from:"java8964"

[Spark-Avro] Question related to the Avro data generated by Spark-Avro

2015-11-16 Thread java8964

Hi, I have one question related to Spark-Avro, not sure if here is the best 
place to ask.
I have the following Scala Case class, populated with the data in the Spark 
application, and I tried to save it as AVRO format in the HDFS
case class Claim (  ..)
case class Coupon (  account_id: Long    claims: List[Claim])
As the above example, the Coupon case class contains List of Claim class.
In the RDD, it holds an Iterator of Coupon data, and I will try to save it into 
the HDFS. I am using Spark 1.3.1, with Spark-Avro 1.0.0 (which matches with 
Spark 1.3.x)
rdd.toDF.save("hdfs_location", "com.databricks.spark.avro")
I have no problem to save the data this way, but the problem is that I cannot 
use the avro data in Hive.
Here is the schema example generated by Spark AVRO for the above data:
{
   "type":"record",
   "name":"topLevelRecord",
   "fields":[{
   "name":"account_id",
   "type":"long"
},{"name":"claims",
"type":[
   {
  "type":"array",
  "items":[
 {
"type":"record",
"name":"claims",
"fields":[
..
The claims field is generated as an union contains array, instead of array of 
structure directly.Or for more clearly, here is the schema in the hive when 
pointing to the data generated by Spark-Avro:desc tableOK
col_namedata_type   comment
account_id  bigint  from deserializer...claims  
uniontype>> 
from deserializerObviously, this causes trouble for Hive to query this data (at 
least in the Hive 0.12, which we are currently use), so end user cannot query 
it in the hive like "select claims[0].account_id from table".
I wonder why Spark-Avro has to wrapping a union structure in this case, instead 
of just building "array"?Or better, is there a way I can control the 
AVRO generated in this case in Spark-AVOR?ThanksYong

RE: In Spark application, how to get the passed in configuration?

2015-11-12 Thread java8964

Thanks, it looks like the config has to start with "spark", a very interesting 
requirement.
I am using Spark 1.3.1, I didn't see this Warning log in the console.

Thanks for your help.
Yong
Date: Thu, 12 Nov 2015 23:03:12 +0530
Subject: Re: In Spark application, how to get the passed in configuration?
From: varunsharman...@gmail.com
To: java8...@hotmail.com
CC: user@spark.apache.org

You must be getting a warning at the start of application like : Warning: 
Ignoring non-spark config property: runtime.environment=passInValue .

Configs in spark should start with spark as prefix. So try something like 
--conf spark.runtime.environment=passInValue .
RegardsVarun
On Thu, Nov 12, 2015 at 9:51 PM, java8964 <java8...@hotmail.com> wrote:

In my Spark application, I want to access the pass in configuration, but it 
doesn't work. How should I do that?
object myCode extends Logging {
  // starting point of the application
  def main(args: Array[String]): Unit = {
val sparkContext = new SparkContext()
val runtimeEnvironment = sparkContext.getConf.get("runtime.environment", 
"default")
Console.println("load properties from runtimeEnvironment: " + 
runtimeEnvironment)
logInfo("load properties from runtimeEnvironment: " + runtimeEnvironment)
sparkContext.stop()
  }
}
/opt/spark/bin/spark-submit --class myCode --conf 
runtime.environment=passInValue my.jarload properties from runtimeEnvironment: 
default
It looks like that I cannot access the dynamic passed in value from the command 
line this way.In the Hadoop, the Configuration object will include all the 
passed in key/value in the application. How to archive that in Spark?ThanksYong 

-- 
VARUN SHARMA
Flipkart
Bangalore

In Spark application, how to get the passed in configuration?

2015-11-12 Thread java8964

In my Spark application, I want to access the pass in configuration, but it 
doesn't work. How should I do that?
object myCode extends Logging {
  // starting point of the application
  def main(args: Array[String]): Unit = {
val sparkContext = new SparkContext()
val runtimeEnvironment = sparkContext.getConf.get("runtime.environment", 
"default")
Console.println("load properties from runtimeEnvironment: " + 
runtimeEnvironment)
logInfo("load properties from runtimeEnvironment: " + runtimeEnvironment)
sparkContext.stop()
  }
}
/opt/spark/bin/spark-submit --class myCode --conf 
runtime.environment=passInValue my.jarload properties from runtimeEnvironment: 
default
It looks like that I cannot access the dynamic passed in value from the command 
line this way.In the Hadoop, the Configuration object will include all the 
passed in key/value in the application. How to archive that in Spark?ThanksYong

RE: Cassandra via SparkSQL/Hive JDBC

2015-11-11 Thread java8964

Any reason that Spark Cassandra connector won't work for you?
Yong

To: bryan.jeff...@gmail.com; user@spark.apache.org
From: bryan.jeff...@gmail.com
Subject: RE: Cassandra via SparkSQL/Hive JDBC
Date: Tue, 10 Nov 2015 22:42:13 -0500

Anyone have thoughts or a similar use-case for SparkSQL / Cassandra?

Regards,

Bryan JeffreyFrom: Bryan Jeffrey
Sent: ‎11/‎4/‎2015 11:16 AM
To: user
Subject: Cassandra via SparkSQL/Hive JDBC

Hello.
I have been working to add SparkSQL HDFS support to our application.  We're 
able to process streaming data, append to a persistent Hive table, and have 
that table available via JDBC/ODBC.  Now we're looking to access data in 
Cassandra via SparkSQL.  
In reading a number of previous posts, it appears that the way to do this is to 
instantiate a Spark Context, read the data into an RDD using the Cassandra 
Spark Connector, convert the data to a DF and register it as a temporary table. 
 The data will then be accessible via SparkSQL - although I assume that you 
would need to refresh the table on a periodic basis.
Is there a more straightforward way to do this?  Is it possible to register the 
Cassandra table with Hive so that the SparkSQL thrift server instance can just 
read data directly?
Regards,
Bryan Jeffrey

RE: RDD's filter() or using 'where' condition in SparkSQL

2015-10-29 Thread java8964

Won't you be able to use case statement to generate a virtual column (like 
partition_num), then use analytic SQL partition by this virtual column?
In this case, the full dataset will be just scanned once.

Yong

Date: Thu, 29 Oct 2015 10:51:53 -0700
Subject: RDD's filter() or using 'where' condition in SparkSQL
From: anfernee...@gmail.com
To: user@spark.apache.org

Hi,
I have a pretty large data set(2M entities) in my RDD, the data has already 
been partitioned by a specific key, the key has a range(type in long), now I 
want to create a bunch of key buckets, for example, the key has range 
1 -> 100,
I will break the whole range into below buckets   1 ->  1011 -> 20
...90 -> 100
 I want to run some analytic SQL functions over the data that owned by each key 
range, so I come up with 2 approaches,
1) run RDD's filter() on the full data set RDD, the filter will create the RDD 
corresponding to each key bucket, and with each RDD, I can create DataFrame and 
run the sql.

2) create a DataFrame for the whole RDD, and using a buch of SQL's to do my job.
SELECT * from  where key>=key1 AND key

RE: RDD's filter() or using 'where' condition in SparkSQL

2015-10-29 Thread java8964

You can do the SQL like following:
select *, case when key >= 1 and key <=10 then 1 when key >= 11 and key <= 20 
then 2 .. else 10 end as bucket_idfrom your table
See the conditional functions "case" in the HIVE.
After you have "bucket_id" column, then you can do whatever analytic function 
you want.
Yong

Date: Thu, 29 Oct 2015 12:53:35 -0700
Subject: Re: RDD's filter() or using 'where' condition in SparkSQL
From: anfernee...@gmail.com
To: java8...@hotmail.com
CC: user@spark.apache.org

Thanks Yong for your response.
Let me see if I can understand what you're suggesting, so the whole data set, 
when I load them into Spark(I'm using custom Hadoop InputFormat), I will add an 
extra field to each element in RDD, like bucket_id.
For example
Key:
1 - 10, bucket_id=111-20, bucket_id=2...90-100, butcket_id =10
then I can re-partition the RDD with a partitioner that will put all records 
with the same bucket_id in the same partition, after I get DataFrame from the 
RDD, the partition is still preserved(is it correct?)
then reset of work is only issue SQL query like
SELECT * from XXX where bucket_id=1SELECT * from XXX where bucket_id=2

..
Am I right?
Thanks
Anfernee
On Thu, Oct 29, 2015 at 11:07 AM, java8964 <java8...@hotmail.com> wrote:

Won't you be able to use case statement to generate a virtual column (like 
partition_num), then use analytic SQL partition by this virtual column?
In this case, the full dataset will be just scanned once.

Yong

Date: Thu, 29 Oct 2015 10:51:53 -0700
Subject: RDD's filter() or using 'where' condition in SparkSQL
From: anfernee...@gmail.com
To: user@spark.apache.org

Hi,
I have a pretty large data set(2M entities) in my RDD, the data has already 
been partitioned by a specific key, the key has a range(type in long), now I 
want to create a bunch of key buckets, for example, the key has range 
1 -> 100,
I will break the whole range into below buckets   1 ->  1011 -> 20
...90 -> 100
 I want to run some analytic SQL functions over the data that owned by each key 
range, so I come up with 2 approaches,
1) run RDD's filter() on the full data set RDD, the filter will create the RDD 
corresponding to each key bucket, and with each RDD, I can create DataFrame and 
run the sql.

2) create a DataFrame for the whole RDD, and using a buch of SQL's to do my job.
SELECT * from  where key>=key1 AND key

RE: Problem with make-distribution.sh

2015-10-26 Thread java8964

Maybe you need the Hive part?
Yong

Date: Mon, 26 Oct 2015 11:34:30 -0400
Subject: Problem with make-distribution.sh
From: yana.kadiy...@gmail.com
To: user@spark.apache.org

Hi folks, 
building spark instructions 
(http://spark.apache.org/docs/latest/building-spark.html) suggest that 

./make-distribution.sh --name custom-spark --tgz -Phadoop-2.4 -Pyarn

should produce a distribution similar to the ones found on the "Downloads" page.
I noticed that the tgz I built using the above command does not produce the 
datanucleus jars which are included in the "boxed" spark distributions. What is 
the best-practice advice here?
I would like my distribution to match the official one as closely as possible.
Thanks

RE: Spark SQL running totals

2015-10-15 Thread java8964

My mistake. I didn't noticed "UNBOUNDED PRECEDING" already supported. 
So cumulative sum should work then.
Thanks
Yong

From: java8...@hotmail.com
To: mich...@databricks.com; deenar.toras...@gmail.com
CC: spanayo...@msn.com; user@spark.apache.org
Subject: RE: Spark SQL running totals
Date: Thu, 15 Oct 2015 16:24:39 -0400

Not sure the windows function can work for his case.
If you do a "sum() over (partitioned by)", that will return a total sum per 
partition, instead of a cumulative sum wanted in this case.
I saw there is a "cume_dis", but no "cume_sum".
Do we really have a "cume_sum" in Spark window function, or am I total 
misunderstand about "sum() over (partitioned by)" in it?
Yong

From: mich...@databricks.com
Date: Thu, 15 Oct 2015 11:51:59 -0700
Subject: Re: Spark SQL running totals
To: deenar.toras...@gmail.com
CC: spanayo...@msn.com; user@spark.apache.org

Check out: 
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
On Thu, Oct 15, 2015 at 11:35 AM, Deenar Toraskar  
wrote:
you can do a self join of the table with itself with the join clause being 
a.col1 >= b.col1
select a.col1, a.col2, sum(b.col2)from tablea as a left outer join tablea as b 
on (a.col1 >= b.col1)group by a.col1, a.col2
I havent tried it, but cant see why it cant work, but doing it in RDD might be 
more efficient see 
https://bzhangusc.wordpress.com/2014/06/21/calculate-running-sums/
On 15 October 2015 at 18:48, Stefan Panayotov  wrote:

Hi,

I need help with Spark SQL. I need to achieve something like the following.
If I have data like:

col_1  col_2
1 10
2 30
3 15
4 20
5 25

I need to get col_3 to be the running total of the sum of the previous rows of 
col_2, e.g.

col_1  col_2  col_3
1 1010
2 3040
3 1555
4 2075
5 25100

Is there a way to achieve this in Spark SQL or maybe with Data frame 
transformations?

Thanks in advance,

Stefan Panayotov, PhD 
Home: 610-355-0919 
Cell: 610-517-5586 
email: spanayo...@msn.com 
spanayo...@outlook.com 
spanayo...@comcast.net

RE: Spark SQL running totals

2015-10-15 Thread java8964

Not sure the windows function can work for his case.
If you do a "sum() over (partitioned by)", that will return a total sum per 
partition, instead of a cumulative sum wanted in this case.
I saw there is a "cume_dis", but no "cume_sum".
Do we really have a "cume_sum" in Spark window function, or am I total 
misunderstand about "sum() over (partitioned by)" in it?
Yong

From: mich...@databricks.com
Date: Thu, 15 Oct 2015 11:51:59 -0700
Subject: Re: Spark SQL running totals
To: deenar.toras...@gmail.com
CC: spanayo...@msn.com; user@spark.apache.org

Check out: 
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
On Thu, Oct 15, 2015 at 11:35 AM, Deenar Toraskar  
wrote:
you can do a self join of the table with itself with the join clause being 
a.col1 >= b.col1
select a.col1, a.col2, sum(b.col2)from tablea as a left outer join tablea as b 
on (a.col1 >= b.col1)group by a.col1, a.col2
I havent tried it, but cant see why it cant work, but doing it in RDD might be 
more efficient see 
https://bzhangusc.wordpress.com/2014/06/21/calculate-running-sums/
On 15 October 2015 at 18:48, Stefan Panayotov  wrote:

Hi,

I need help with Spark SQL. I need to achieve something like the following.
If I have data like:

col_1  col_2
1 10
2 30
3 15
4 20
5 25

I need to get col_3 to be the running total of the sum of the previous rows of 
col_2, e.g.

col_1  col_2  col_3
1 1010
2 3040
3 1555
4 2075
5 25100

Is there a way to achieve this in Spark SQL or maybe with Data frame 
transformations?

Thanks in advance,

Stefan Panayotov, PhD 
Home: 610-355-0919 
Cell: 610-517-5586 
email: spanayo...@msn.com 
spanayo...@outlook.com 
spanayo...@comcast.net

RE: Spark DataFrame GroupBy into List

2015-10-14 Thread java8964

My guess is the same as UDAF of (collect_set) in Hive.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF)
Yong

From: sliznmail...@gmail.com
Date: Wed, 14 Oct 2015 02:45:48 +
Subject: Re: Spark DataFrame GroupBy into List
To: mich...@databricks.com
CC: user@spark.apache.org

Hi Michael, 
Can you be more specific on `collect_set`? Is it a built-in function or, if it 
is an UDF, how it is defined?
BR,Todd Leo
On Wed, Oct 14, 2015 at 2:12 AM Michael Armbrust  wrote:
import org.apache.spark.sql.functions._
df.groupBy("category")  .agg(callUDF("collect_set", df("id")).as("id_list"))
On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu  wrote:
Hey Spark users,
I'm trying to group by a dataframe, by appending occurrences into a list 
instead of count. 
Let's say we have a dataframe as shown below:| category | id |
|  |:--:|
| A| 1  |
| A| 2  |
| B| 3  |
| B| 4  |
| C| 5  |
ideally, after some magic group by (reverse explode?):| category | id_list  |
|  |  |
| A| 1,2  |
| B| 3,4  |
| C| 5|
any tricks to achieve that? Scala Spark API is preferred. =D
BR,Todd Leo

How to handle the UUID in Spark 1.3.1

2015-10-09 Thread java8964

Hi,  Sparkers:
In this case, I want to use Spark as an ETL engine to load the data from 
Cassandra, and save it into HDFS.
Here is the environment specified information:
Spark 1.3.1Cassandra 2.1HDFS/Hadoop 2.2
I am using the Cassandra Spark Connector 1.3.x, which I have no problem to 
query the C* data in the Spark. But I have a problem trying to save the data 
into HDFS, like below:
val df = sqlContext.load("org.apache.spark.sql.cassandra", options = Map( 
"c_table" -> "table_name", "keyspace" -> "keyspace_name")df: 
org.apache.spark.sql.DataFrame = [account_id: bigint, campaign_id: uuid, 
business_info_ids: array, closed_date: timestamp, compliance_hold: 
boolean, contacts_list_id: uuid, contacts_list_seq: bigint, currency_type: 
string, deleted_date: timestamp, discount_info: map, end_date: 
timestamp, insert_by: string, insert_time: timestamp, last_update_by: string, 
last_update_time: timestamp, name: string, parent_id: uuid, publish_date: 
timestamp, share_incentive: map, start_date: timestamp, version: 
int]
scala> df.countres12: Long = 757704
I can also dump the data output suing df.first, without any problem.
But when I try to save it:
scala> df.save("hdfs://location", "parquet")java.lang.RuntimeException: 
Unsupported datatype UUIDType   at scala.sys.package$.error(package.scala:27)   
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:372)
 at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:316)
 at scala.Option.getOrElse(Option.scala:120) at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:315)
 at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:395)
  at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:394)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.immutable.List.foreach(List.scala:318)  at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)at 
scala.collection.AbstractTraversable.map(Traversable.scala:105)  at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypes.scala:393)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:440)
at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.prepareMetadata(newParquet.scala:260)
at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:276)
   at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:269)
   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.immutable.List.foreach(List.scala:318)  at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)at 
scala.collection.AbstractTraversable.map(Traversable.scala:105)  at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:269)
at 
org.apache.spark.sql.parquet.ParquetRelation2.(newParquet.scala:391)   at 
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:98)  
 at 
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:128) 
 at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:240)   
 at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1196)at 
org.apache.spark.sql.DataFrame.save(DataFrame.scala:1156)at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
 at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)  at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)   at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:37)at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:39) at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:41)  at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)   at 
$iwC$$iwC$$iwC$$iwC$$iwC.(:45)at 
$iwC$$iwC$$iwC$$iwC.(:47) at 
$iwC$$iwC$$iwC.(:49)  at $iwC$$iwC.(:51)   at 
$iwC.(:53)at (:55) at .(:59)   
 at .() at .(:7) at .() at 
$print()at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)   
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)   at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)   at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)   at

RE: How to handle the UUID in Spark 1.3.1

2015-10-09 Thread java8964

Thanks, Ted.
Does this mean I am out of luck for now? If I use HiveContext, and cast the 
UUID as string, will it work?
Yong

Date: Fri, 9 Oct 2015 09:09:38 -0700
Subject: Re: How to handle the UUID in Spark 1.3.1
From: yuzhih...@gmail.com
To: java8...@hotmail.com
CC: user@spark.apache.org

This is related:SPARK-10501

On Fri, Oct 9, 2015 at 7:28 AM, java8964 <java8...@hotmail.com> wrote:

Hi,  Sparkers:
In this case, I want to use Spark as an ETL engine to load the data from 
Cassandra, and save it into HDFS.
Here is the environment specified information:
Spark 1.3.1Cassandra 2.1HDFS/Hadoop 2.2
I am using the Cassandra Spark Connector 1.3.x, which I have no problem to 
query the C* data in the Spark. But I have a problem trying to save the data 
into HDFS, like below:
val df = sqlContext.load("org.apache.spark.sql.cassandra", options = Map( 
"c_table" -> "table_name", "keyspace" -> "keyspace_name")df: 
org.apache.spark.sql.DataFrame = [account_id: bigint, campaign_id: uuid, 
business_info_ids: array, closed_date: timestamp, compliance_hold: 
boolean, contacts_list_id: uuid, contacts_list_seq: bigint, currency_type: 
string, deleted_date: timestamp, discount_info: map<string,string>, end_date: 
timestamp, insert_by: string, insert_time: timestamp, last_update_by: string, 
last_update_time: timestamp, name: string, parent_id: uuid, publish_date: 
timestamp, share_incentive: map<string,string>, start_date: timestamp, version: 
int]
scala> df.countres12: Long = 757704
I can also dump the data output suing df.first, without any problem.
But when I try to save it:
scala> df.save("hdfs://location", "parquet")java.lang.RuntimeException: 
Unsupported datatype UUIDType   at scala.sys.package$.error(package.scala:27)   
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:372)
 at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:316)
 at scala.Option.getOrElse(Option.scala:120) at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:315)
 at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:395)
  at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$4.apply(ParquetTypes.scala:394)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.immutable.List.foreach(List.scala:318)  at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)at 
scala.collection.AbstractTraversable.map(Traversable.scala:105)  at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypes.scala:393)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:440)
at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.prepareMetadata(newParquet.scala:260)
at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:276)
   at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:269)
   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.immutable.List.foreach(List.scala:318)  at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)at 
scala.collection.AbstractTraversable.map(Traversable.scala:105)  at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:269)
at 
org.apache.spark.sql.parquet.ParquetRelation2.(newParquet.scala:391)   at 
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:98)  
 at 
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:128) 
 at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:240)   
 at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1196)at 
org.apache.spark.sql.DataFrame.save(DataFrame.scala:1156)at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
 at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)  at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)   at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:37)at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:39) at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:41)  at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)   at 
$iwC$$iwC$$iwC$$iwC$$iwC.(:45)at 
$iwC$$iwC$$iwC$$iwC.(:47) at 
$iwC$$iwC$$iwC.(:49)  at $iwC$$iwC.(:51)   at 
$iwC.(:53)at (:55) at .(:59)   
 at .() at .(:7) at .() at 
$print()at sun.reflect.NativeMethodAccessorImpl.invok

RE: Building RDD for a Custom MPP Database

2015-10-05 Thread java8964

You want to implement a custom InputFormat for your MPP, which can provide the 
location preference information to Spark.
Yong

> Date: Mon, 5 Oct 2015 10:53:27 -0700
> From: vjan...@sankia.com
> To: user@spark.apache.org
> Subject: Building RDD for a Custom MPP Database
> 
> Hi
> I have to build a RDD for a custom MPP database, which is shared across
> several nodes. I would like to do this using Java; Can I extend the JavaRDD
> and override the specific methods? Also, if can I override the
> getlocationPreferences methods as well? Is there any other alternatives,
> where I can leverage existing RDD?
> 
> Any pointers appreciated 
> 
> Thanks
> VJ
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Building-RDD-for-a-Custom-MPP-Database-tp24934.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

RE: Problem understanding spark word count execution

2015-10-02 Thread java8964

No problem.
>From the mapper side, Spark is very similar as the MapReduce; but on the 
>reducer fetching side, MR uses sort merge vs Spark uses HashMap.
So keep this in mind that you can get data automatically sorted on the reducer 
side on MR, but not in Spark.
Spark's performance comes:Cache ability and smart arranging the tasks into 
stages. Intermediate data between stages never stored in HDFS, but in local 
disk. In MR jobs, from one MR job to another one, the intermediate data stored 
in HDFS.Spark uses threads to run tasks, instead of heavy process as MR.
Without caching, in my experience, Spark can get about 2x to 5x better than MR 
job, depending on the jog logic. If the data volume is small, Spark will be 
even better, as the processor is way more expensive than the thread in this 
case.
I didn't see your Spark script, so my guess is that you are using 
"rdd.collect()", which will transfer the final result to driver and dump it in 
the console.
Yong
Date: Fri, 2 Oct 2015 00:50:24 -0700
Subject: Re: Problem understanding spark word count execution
From: kar...@bluedata.com
To: java8...@hotmail.com
CC: nicolae.maras...@adswizz.com; user@spark.apache.org

Thanks Yong , 
That was a good explanation I was looking for , however I have one doubt , you 
write - "Image that you have 2 mappers to read the data, then each mapper will 
generate the (word, count) tuple output in segments. Spark always output that 
in local file. (In fact, one file with different segments to represent 
different partitions) "  if this is true then spark is very similar to Hadoop 
MapReduce (Disk IO bw phases) , with so many IOs after each stage how does 
spark achieves the performance that it does as compared to map reduce . Another 
doubt is  "The 2000 bytes sent to driver is the final output aggregated on the 
reducers end, and merged back to the driver." , which part of our word count 
code takes care of this part ? And yes there are only 273 distinct words in the 
text so that's not a surprise.
Thanks again,
Hope to get a reply.
--Kartik
On Thu, Oct 1, 2015 at 5:49 PM, java8964 <java8...@hotmail.com> wrote:

I am not sure about originally explain of shuffle write. 
In the word count example, the shuffle is needed, as Spark has to group by the 
word (ReduceBy is more accurate here). Image that you have 2 mappers to read 
the data, then each mapper will generate the (word, count) tuple output in 
segments. Spark always output that in local file. (In fact, one file with 
different segments to represent different partitions).
As you can image, the output of these segments will be small, as it only 
contains (word, count of word) tuples. After each mapper generates this 
segmented file for different partitions, then the reduce will fetch the 
partitions belonging to itself.
In your job summery, if your source is text file, so your data corresponds to 2 
HDFS block, or 2x256M. There are 2 tasks concurrent read these 2 partitions, 
about 2.5M lines of data of each partition being processed.
The output of each partition is shuffle-writing 2.7K data, which is the size of 
the segment file generated, corresponding to all the unique words and their 
count of this partition. So the size is reasonable, at least for me.
The interested number is 273 as shuffle write records. I am not 100% sure its 
meaning. Does it mean that this partition have 273 unique words from these 2.5M 
lines of data? That is kind of low, but I really don't have other explaining of 
its meaning.
If you finally output shows hundreds of unique words, then it is.
The 2000 bytes sent to driver is the final output aggregated on the reducers 
end, and merged back to the driver.
Yong

Date: Thu, 1 Oct 2015 13:33:59 -0700
Subject: Re: Problem understanding spark word count execution
From: kar...@bluedata.com
To: nicolae.maras...@adswizz.com
CC: user@spark.apache.org

Hi Nicolae,Thanks for the reply. To further clarify things -
sc.textFile is reading from HDFS, now shouldn't the file be read in a way such 
that EACH executer works on only the local copy of file part available , in 
this case its a ~ 4.64 GB file and block size is 256MB, so approx 19 partitions 
will be created and each task will run on  1 partition (which is what I am 
seeing in the stages logs) , also i assume it will read the file in a way that 
each executer will have exactly same amount of data. so there shouldn't be any 
shuffling in reading atleast.
During the stage 0 (sc.textFile -> flatMap -> Map) for every task this is the 
output I am seeing
IndexIDAttemptStatusLocality LevelExecutor ID / HostLaunch TimeDurationGC 
TimeInput Size / RecordsWrite TimeShuffle Write Size / 
RecordsErrors0440SUCCESSNODE_LOCAL1 / 10.35.244.102015/09/29 13:57:2414 s0.2 
s256.0 MB (hadoop) / 25951612.7 KB / 2731450SUCCESSNODE_LOCAL2 / 
10.35.244.112015/09/29 13:57:2413 s0.2 s256.0 MB (hadoop) / 25951762.7 KB / 
273I have following questions -
1) What exactly is 2.7KB

RE: Problem understanding spark word count execution

2015-10-02 Thread java8964

These parameters in fact control the behavior on reduce side, as in your word 
count example.
The partitions will be fetched by the reducer which being assigned to it. The 
reducer will fetch corresponding partitions from different mappers output, and 
it will process the data based on your logic while fetching them. This memory 
area is a sortBuffer area, and depending on "spark.shuffle.spill" (for memory 
only or memory + disk), Spark will use different implementations (AppendOnlyMap 
and ExternalAppendOnlyMap) to handle it.
The Spark shuffle memoryFraction is to control what fraction of java heap to 
use as the SortBuffer area.
You can find more information in this Jira:
https://issues.apache.org/jira/browse/SPARK-2045
Yong

Date: Fri, 2 Oct 2015 11:55:41 -0700
Subject: Re: Problem understanding spark word count execution
From: kar...@bluedata.com
To: java8...@hotmail.com
CC: nicolae.maras...@adswizz.com; user@spark.apache.org

Thanks Yong,
my script is pretty straight forward - 
sc.textFile("/wc/input").flatMap(line => line.split(" ")).map(word => 
(word,1)).reduceByKey(_+_).saveAsTextFile("/wc/out2") //both paths are HDFS.
so if for every shuffle write , it always writes to disk , what is the meaning 
of these properties -
spark.shuffle.memoryFraction
spark.shuffle.spill

Thanks,Kartik

On Fri, Oct 2, 2015 at 6:22 AM, java8964 <java8...@hotmail.com> wrote:

No problem.
>From the mapper side, Spark is very similar as the MapReduce; but on the 
>reducer fetching side, MR uses sort merge vs Spark uses HashMap.
So keep this in mind that you can get data automatically sorted on the reducer 
side on MR, but not in Spark.
Spark's performance comes:Cache ability and smart arranging the tasks into 
stages. Intermediate data between stages never stored in HDFS, but in local 
disk. In MR jobs, from one MR job to another one, the intermediate data stored 
in HDFS.Spark uses threads to run tasks, instead of heavy process as MR.
Without caching, in my experience, Spark can get about 2x to 5x better than MR 
job, depending on the jog logic. If the data volume is small, Spark will be 
even better, as the processor is way more expensive than the thread in this 
case.
I didn't see your Spark script, so my guess is that you are using 
"rdd.collect()", which will transfer the final result to driver and dump it in 
the console.
Yong
Date: Fri, 2 Oct 2015 00:50:24 -0700
Subject: Re: Problem understanding spark word count execution
From: kar...@bluedata.com
To: java8...@hotmail.com
CC: nicolae.maras...@adswizz.com; user@spark.apache.org

Thanks Yong , 
That was a good explanation I was looking for , however I have one doubt , you 
write - "Image that you have 2 mappers to read the data, then each mapper will 
generate the (word, count) tuple output in segments. Spark always output that 
in local file. (In fact, one file with different segments to represent 
different partitions) "  if this is true then spark is very similar to Hadoop 
MapReduce (Disk IO bw phases) , with so many IOs after each stage how does 
spark achieves the performance that it does as compared to map reduce . Another 
doubt is  "The 2000 bytes sent to driver is the final output aggregated on the 
reducers end, and merged back to the driver." , which part of our word count 
code takes care of this part ? And yes there are only 273 distinct words in the 
text so that's not a surprise.
Thanks again,
Hope to get a reply.
--Kartik
On Thu, Oct 1, 2015 at 5:49 PM, java8964 <java8...@hotmail.com> wrote:

I am not sure about originally explain of shuffle write. 
In the word count example, the shuffle is needed, as Spark has to group by the 
word (ReduceBy is more accurate here). Image that you have 2 mappers to read 
the data, then each mapper will generate the (word, count) tuple output in 
segments. Spark always output that in local file. (In fact, one file with 
different segments to represent different partitions).
As you can image, the output of these segments will be small, as it only 
contains (word, count of word) tuples. After each mapper generates this 
segmented file for different partitions, then the reduce will fetch the 
partitions belonging to itself.
In your job summery, if your source is text file, so your data corresponds to 2 
HDFS block, or 2x256M. There are 2 tasks concurrent read these 2 partitions, 
about 2.5M lines of data of each partition being processed.
The output of each partition is shuffle-writing 2.7K data, which is the size of 
the segment file generated, corresponding to all the unique words and their 
count of this partition. So the size is reasonable, at least for me.
The interested number is 273 as shuffle write records. I am not 100% sure its 
meaning. Does it mean that this partition have 273 unique words from these 2.5M 
lines of data? That is kind of low, but I really don't have other explaining of 
its meaning.
I

RE: Problem understanding spark word count execution

2015-10-01 Thread java8964

I am not sure about originally explain of shuffle write. 
In the word count example, the shuffle is needed, as Spark has to group by the 
word (ReduceBy is more accurate here). Image that you have 2 mappers to read 
the data, then each mapper will generate the (word, count) tuple output in 
segments. Spark always output that in local file. (In fact, one file with 
different segments to represent different partitions).
As you can image, the output of these segments will be small, as it only 
contains (word, count of word) tuples. After each mapper generates this 
segmented file for different partitions, then the reduce will fetch the 
partitions belonging to itself.
In your job summery, if your source is text file, so your data corresponds to 2 
HDFS block, or 2x256M. There are 2 tasks concurrent read these 2 partitions, 
about 2.5M lines of data of each partition being processed.
The output of each partition is shuffle-writing 2.7K data, which is the size of 
the segment file generated, corresponding to all the unique words and their 
count of this partition. So the size is reasonable, at least for me.
The interested number is 273 as shuffle write records. I am not 100% sure its 
meaning. Does it mean that this partition have 273 unique words from these 2.5M 
lines of data? That is kind of low, but I really don't have other explaining of 
its meaning.
If you finally output shows hundreds of unique words, then it is.
The 2000 bytes sent to driver is the final output aggregated on the reducers 
end, and merged back to the driver.
Yong

Date: Thu, 1 Oct 2015 13:33:59 -0700
Subject: Re: Problem understanding spark word count execution
From: kar...@bluedata.com
To: nicolae.maras...@adswizz.com
CC: user@spark.apache.org

Hi Nicolae,Thanks for the reply. To further clarify things -
sc.textFile is reading from HDFS, now shouldn't the file be read in a way such 
that EACH executer works on only the local copy of file part available , in 
this case its a ~ 4.64 GB file and block size is 256MB, so approx 19 partitions 
will be created and each task will run on  1 partition (which is what I am 
seeing in the stages logs) , also i assume it will read the file in a way that 
each executer will have exactly same amount of data. so there shouldn't be any 
shuffling in reading atleast.
During the stage 0 (sc.textFile -> flatMap -> Map) for every task this is the 
output I am seeing
IndexIDAttemptStatusLocality LevelExecutor ID / HostLaunch TimeDurationGC 
TimeInput Size / RecordsWrite TimeShuffle Write Size / 
RecordsErrors0440SUCCESSNODE_LOCAL1 / 10.35.244.102015/09/29 13:57:2414 s0.2 
s256.0 MB (hadoop) / 25951612.7 KB / 2731450SUCCESSNODE_LOCAL2 / 
10.35.244.112015/09/29 13:57:2413 s0.2 s256.0 MB (hadoop) / 25951762.7 KB / 
273I have following questions -
1) What exactly is 2.7KB of shuffle write  ?2) is this 2.7 KB of shuffle write 
is local to that executer ?3) In the executers log I am seeing 2000 bytes 
results sent to the driver , if instead this number is much much greater than 
2000 byes such that it does not fit in executer's memory , will shuffle write 
increase ?4)For word count 256 MB data is substantial amount text , how come 
the result for this stage is only 2000 bytes !! it should send everyword with 
respective count , for a 256 MB input this result should be much bigger ? 
I hope I am clear this time.
Hope to get a reply,
ThanksKartik


On Thu, Oct 1, 2015 at 12:38 PM, Nicolae Marasoiu 
 wrote:







Hi,




So you say " sc.textFile
-> flatMap -> Map".



My understanding is like this:
First step is a number of partitions are determined, p of them. You can give 
hint on this.
Then the nodes which will load partitions p, that is n nodes (where n<=p).



Relatively at the same time or not, the n nodes start opening different 
sections of the file - the physical equivalent of the
partitions: for instance in HDFS they would do an open and a seek I guess and 
just read from the stream there, convert to whatever the InputFormat dictates.


The shuffle can only be the part when a node opens an HDFS file for instance 
but the node does not have a local replica of the blocks which it needs to read 
(those pertaining to his assigned partitions). So he needs to pick them up from 
remote
nodes which do have replicas of that data.



After blocks are read into memory, flatMap and Map are local computations 
generating new RDDs and in the end the result is sent to the driver (whatever 
termination computation does on the RDD like the result of reduce, or side 
effects of rdd.foreach, etc).



Maybe you can share more of your context if still unclear.
I just made assumptions to give clarity on a similar thing.



Nicu



From: Kartik Mathur 

Sent: Thursday, October 1, 2015 10:25 PM

To: Nicolae Marasoiu

Cc: user

Subject: Re: Problem understanding spark word count execution
 


Thanks Nicolae , 
So In my case all executers are sending results back to the driver

RE: Setting executors per worker - Standalone

2015-09-29 Thread java8964

I don't think you can do that in the Standalone mode before 1.5.
The best you can do is to have multi workers per box. One worker can and will 
only start one executor, before Spark 1.5.
What you can do is to set "SPARK_WORKER_INSTANCES", which control how many 
worker instances you can start per box.
Yong 

Date: Mon, 28 Sep 2015 20:47:18 -0700
Subject: Re: Setting executors per worker - Standalone
From: james.p...@gmail.com
To: zjf...@gmail.com
CC: user@spark.apache.org

Thanks for your reply.
Setting it as 
--conf spark.executor.cores=1 
when I start spark-shell (as an example application) indeed sets the number of 
cores per executor as 1 (which is 4 before), but I still have 1 executor per 
worker. What I am really looking for is having 1 worker with 4 executor (each 
with one core) per machine when I run my application. Based one the 
documentation it seems it is feasible, but it is not clear as how.
Thanks.
On Mon, Sep 28, 2015 at 8:46 PM, Jeff Zhang  wrote:
use "--executor-cores 1" you will get 4 executors per worker since you have 4 
cores per worker

On Tue, Sep 29, 2015 at 8:24 AM, James Pirz  wrote:
Hi,
I am using speak 1.5 (standalone mode) on a cluster with 10 nodes while each 
machine has 12GB of RAM and 4 cores. On each machine I have one worker which is 
running one executor that grabs all 4 cores. I am interested to check the 
performance with "one worker but 4 executors per machine - each with one core".
I can see that "running multiple executors per worker in Standalone mode" is 
possible based on the closed issue:
https://issues.apache.org/jira/browse/SPARK-1706

But I can not find a way to do that. "SPARK_EXECUTOR_INSTANCES" is only 
available for the Yarn mode, and in the standalone mode I can just set 
"SPARK_WORKER_INSTANCES" and "SPARK_WORKER_CORES" and "SPARK_WORKER_MEMORY".
Any hint or suggestion would be great.

-- 
Best Regards

Jeff Zhang

RE: nested collection object query

2015-09-29 Thread java8964

You have 2 options:
Option 1:
Use lateral view explode, as you did below. But if you want to remove the 
duplicate, then use distinct after that.
For example:
col1, col2, ArrayOf(Struct)
After explode:
col1, col2, employee0col1, col2, employee1col1, col2, employee0
Then select distinct col1, col2 from ... where emp.name='employee0'
Option 2: Implement your own UDF, to do the logic you want to do. In fact, in 
the Hive, there is already one called array_contains(), which check if the 
array contain the data you want. But in  your case, your data in the array is a 
struct, and you only want to compare name of the struct, instead of whole 
struct. You need to override the equals() logic of array_contains() in the 
Hive, so you have to implement that by custom UDF.
See the hive function of array_contains here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-CollectionFunctions
Yong
From: tridib.sama...@live.com
To: java8...@hotmail.com; user@spark.apache.org
Subject: RE: nested collection object query
Date: Mon, 28 Sep 2015 23:02:41 -0700




Well I figure out a way to use explode. But it returns two rows if there is two 
match in nested array objects.
 
select id from department LATERAL VIEW explode(employee) dummy_table as emp 
where emp.name = 'employee0'
 
I was looking for an operator that loops through the array and return true if 
it matches the condition and returns the parent object.
From: tridib.sama...@live.com
To: java8...@hotmail.com; user@spark.apache.org
Subject: RE: nested collection object query
Date: Mon, 28 Sep 2015 22:26:46 -0700




Thanks for you response Yong! Array syntax works fine. But I am not sure how to 
use explode. Should I use as follows?
select id from department where explode(employee).name = 'employee0
 
This query gives me java.lang.UnsupportedOperationException . I am using 
HiveContext.
 
From: java8...@hotmail.com
To: tridib.sama...@live.com; user@spark.apache.org
Subject: RE: nested collection object query
Date: Mon, 28 Sep 2015 20:42:11 -0400




Your employee in fact is an array of struct, not just struct.
If you are using HiveSQLContext, then you can refer it like following:
select id from member where employee[0].name = 'employee0'
The employee[0] is pointing to the 1st element of the array. 
If you want to query all the elements in the array, then you have to use 
"explode" in the Hive. 
See Hive document for this:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode
Yong

> Date: Mon, 28 Sep 2015 16:37:23 -0700
> From: tridib.sama...@live.com
> To: user@spark.apache.org
> Subject: nested collection object query
> 
> Hi Friends,
> What is the right syntax to query on collection of nested object? I have a
> following schema and SQL. But it does not return anything. Is the syntax
> correct?
> 
> root
>  |-- id: string (nullable = false)
>  |-- employee: array (nullable = false)
>  ||-- element: struct (containsNull = true)
>  |||-- id: string (nullable = false)
>  |||-- name: string (nullable = false)
>  |||-- speciality: string (nullable = false)
> 
> 
> select id from member where employee.name = 'employee0'
> 
> Uploaded a test if some one want to try it out. NestedObjectTest.java
> 
>   
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/nested-collection-object-query-tp24853.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

RE: nested collection object query

2015-09-28 Thread java8964

Your employee in fact is an array of struct, not just struct.
If you are using HiveSQLContext, then you can refer it like following:
select id from member where employee[0].name = 'employee0'
The employee[0] is pointing to the 1st element of the array. 
If you want to query all the elements in the array, then you have to use 
"explode" in the Hive. 
See Hive document for this:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode
Yong

> Date: Mon, 28 Sep 2015 16:37:23 -0700
> From: tridib.sama...@live.com
> To: user@spark.apache.org
> Subject: nested collection object query
> 
> Hi Friends,
> What is the right syntax to query on collection of nested object? I have a
> following schema and SQL. But it does not return anything. Is the syntax
> correct?
> 
> root
>  |-- id: string (nullable = false)
>  |-- employee: array (nullable = false)
>  ||-- element: struct (containsNull = true)
>  |||-- id: string (nullable = false)
>  |||-- name: string (nullable = false)
>  |||-- speciality: string (nullable = false)
> 
> 
> select id from member where employee.name = 'employee0'
> 
> Uploaded a test if some one want to try it out. NestedObjectTest.java
> 
>   
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/nested-collection-object-query-tp24853.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

RE: Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive

2015-09-28 Thread java8964

Hi, Lian:
Thanks for the information. It works as expect in the spark with this setting.
Yong

Subject: Re: Is this a Spark issue or Hive issue that Spark cannot read the 
string type data in the Parquet generated by Hive
To: java8...@hotmail.com; user@spark.apache.org
From: lian.cs@gmail.com
Date: Fri, 25 Sep 2015 14:42:55 -0700

Please set the the SQL option spark.sql.parquet.binaryAsString
to true when reading Parquet files containing strings generated by
Hive.

This is actually a bug of parquet-hive. When generating Parquet
schema for a string field, Parquet requires a "UTF8" annotation,
something like:

message hive_schema {

  ...

  optional binary column2 (UTF8);

  ...

}

but parquet-hive fails to add it, and produces:

message hive_schema {

  ...

  optional binary column2;

  ...

}

  Thus binary fields and string fields are made indistinguishable. 

  Interestingly, there's another bug in parquet-thrift, which always
  adds UTF8 annotation to all binary fields :)

  Cheng

  On 9/25/15 2:03 PM, java8964 wrote:

  Hi, Spark Users:

I have a problem related to Spark cannot recognize the
  string type in the Parquet schema generated by Hive.

Version of all components:

Spark 1.3.1
Hive 0.12.0
Parquet 1.3.2

I generated a detail low level table in the Parquet format
  using MapReduce java code. This table can be read in the Hive
  and Spark without any issue.

Now I create a Hive aggregation table like following:

create external table T (
column1 bigint,
column2 string,
..
)
partitioned by (dt string)

  ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
  STORED AS
  INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
  OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat"
  location '/hdfs_location'

Then the table is populated in the Hive by:

  set hive.exec.compress.output=true;
  set parquet.compression=snappy;

insert into table T partition(dt='2015-09-23')
select 
.
from Detail_Table
group by 

After this, we can query the T table in the Hive without
  issue.

But if I try to use it in the Spark 1.3.1 like following:

import org.apache.spark.sql.SQLContext
val sqlContext = new
  org.apache.spark.sql.hive.HiveContext(sc)
val
  v_event_cnt=sqlContext.parquetFile("/hdfs_location/dt=2015-09-23")

  scala> v_event_cnt.printSchema
  root
   |-- column1: long (nullable = true)
   |-- column2: binary (nullable =
true)
   |-- 
   |-- dt: string (nullable = true)

The Spark will recognize
column2 as binary type, instead of string type in this case,
but in the Hive, it works fine.
So this bring an issue that in the Spark, the data will be
  dumped as "[B@e353d68". To use it in the Spark, I have to cast
  it as string, to get the correct value out of it.

I wonder this mismatch type of Parquet file could be caused
  by which part? Is the Hive not generate the correct Parquet
  file with schema, or Spark in fact cannot recognize it due to problem 
in it. 

Is there a way I can do
either Hive or Spark to make this parquet schema correctly
on both ends?

Thanks

Yong

Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive

2015-09-25 Thread java8964

Hi, Spark Users:
I have a problem related to Spark cannot recognize the string type in the 
Parquet schema generated by Hive.
Version of all components:
Spark 1.3.1Hive 0.12.0Parquet 1.3.2
I generated a detail low level table in the Parquet format using MapReduce java 
code. This table can be read in the Hive and Spark without any issue.
Now I create a Hive aggregation table like following:
create external table T (column1 bigint,column2 string,
..)partitioned by (dt string)ROW FORMAT SERDE 
'parquet.hive.serde.ParquetHiveSerDe'STORED ASINPUTFORMAT 
"parquet.hive.DeprecatedParquetInputFormat"OUTPUTFORMAT 
"parquet.hive.DeprecatedParquetOutputFormat"location '/hdfs_location'
Then the table is populated in the Hive by:
set hive.exec.compress.output=true;set parquet.compression=snappy;
insert into table T partition(dt='2015-09-23')select .from 
Detail_Tablegroup by 
After this, we can query the T table in the Hive without issue.
But if I try to use it in the Spark 1.3.1 like following:
import org.apache.spark.sql.SQLContextval sqlContext = new 
org.apache.spark.sql.hive.HiveContext(sc)val 
v_event_cnt=sqlContext.parquetFile("/hdfs_location/dt=2015-09-23")
scala> v_event_cnt.printSchemaroot |-- column1: long (nullable = true) |-- 
column2: binary (nullable = true) |--  |-- dt: string (nullable = 
true)
The Spark will recognize column2 as binary type, instead of string type in this 
case, but in the Hive, it works fine.So this bring an issue that in the Spark, 
the data will be dumped as "[B@e353d68". To use it in the Spark, I have to cast 
it as string, to get the correct value out of it.
I wonder this mismatch type of Parquet file could be caused by which part? Is 
the Hive not generate the correct Parquet file with schema, or Spark in fact 
cannot recognize it due to problem in it. 
Is there a way I can do either Hive or Spark to make this parquet schema 
correctly on both ends?
Thanks
Yong

RE: Java Heap Space Error

2015-09-24 Thread java8964

This is interesting.
So you mean that query as 
"select userid from landing where dt='2015-9' and userid != '' and userid is 
not null and userid is not NULL and pagetype = 'productDetail' group by userid"
works in your cluster?
In this case, do you also see this one task with way more data than the rest, 
as it happened when you use regex and concatenation?
It is hard to believe that just add "regex" and "concatenation" will make the 
distribution more equally across partitions. In your query, the distribution in 
the partitions simply depends on the Hash partitioner of "userid".
Can you show us the query after you add "regex" and "concatenation"?
Yong

Subject: Re: Java Heap Space Error
From: yu...@useinsider.com
Date: Thu, 24 Sep 2015 15:34:48 +0300
CC: user@spark.apache.org
To: jingyu.zh...@news.com.au; java8...@hotmail.com

@JingyuYes, it works without regex and concatenation as the query below:
So, what we can understand from this? Because when i do like that, shuffle read 
sizes are equally distributed between partitions.
val usersInputDF = sqlContext.sql(s""" |  select userid from landing 
where dt='2015-9' and userid != '' and userid is not null and userid is not 
NULL and pagetype = 'productDetail' group by userid
   """.stripMargin)
@java8964
I tried with sql.shuffle.partitions = 1 but no luck. It’s again one of the 
partitions shuffle size is huge and the others are very small.

——So how can i balance this shuffle read size between partitions?

On 24 Sep 2015, at 03:35, Zhang, Jingyu <jingyu.zh...@news.com.au> wrote:Is you 
sql works if do not runs a regex on strings and concatenates them, I mean just 
Select the stuff without String operations?

On 24 September 2015 at 10:11, java8964 <java8...@hotmail.com> wrote:



Try to increase partitions count, that will make each partition has less data.
Yong

Subject: Re: Java Heap Space Error
From: yu...@useinsider.com
Date: Thu, 24 Sep 2015 00:32:47 +0300
CC: user@spark.apache.org
To: java8...@hotmail.com

Yes, it’s possible. I use S3 as data source. My external tables has 
partitioned. Belowed task is 193/200. Job has 2 stages and its 193. task of 200 
in 2.stage because of sql.shuffle.partitions. 
How can i avoid this situation, this is my query:
select userid,concat_ws(' ',collect_list(concat_ws(' ',if(productname is not 
NULL,lower(productname),''),lower(regexp_replace(regexp_replace(substr(productcategory,2,length(productcategory)-2),'\"',''),\",\",'
 ') inputlist from landing where dt='2015-9' and userid != '' and userid is 
not null and userid is not NULL and pagetype = 'productDetail' group by userid

On 23 Sep 2015, at 23:55, java8964 <java8...@hotmail.com> wrote:
Based on your description, you job shouldn't have any shuffle then, as you just 
apply regex and concatenation on the column, but there is one partition having 
4.3M records to be read, vs less than 1M records for other partitions.
Is that possible? It depends on what is the source of your data.
If there is shuffle in your query (More than 2 stages generated by your query, 
and this is my guess of what happening), then it simple means that one 
partition having way more data than the rest of partitions.
Yong
From: yu...@useinsider.com
Subject: Java Heap Space Error
Date: Wed, 23 Sep 2015 23:07:17 +0300
To: user@spark.apache.org

What can cause this issue in the attached picture? I’m running and sql query 
which runs a regex on strings and concatenates them. Because of this task, my 
job gives java heap space error.

  





This message and its attachments may contain legally privileged or confidential 
information. It is intended solely for the named addressee. If you are not the 
addressee indicated in this message or responsible for delivery of the message 
to the addressee, you may not copy or deliver this message or its attachments 
to anyone. Rather, you should permanently delete this message and its 
attachments and kindly notify the sender by reply e-mail. Any content of this 
message and its attachments which does not relate to the official business of 
the sending company must be taken not to have been sent or endorsed by that 
company or any of its related entities. No warranty is made that the e-mail or 
attachments are free from computer virus or other defect.

RE: Java Heap Space Error

2015-09-24 Thread java8964

I can understand why your first query will finish without OOM, but the new one 
will fail with OOM.
In the new query, you are asking a groupByKey/cogroup operation, which will 
force all the productName + prodcutionCatagory per user id sent to the same 
reducer. This could easily below out reducer's memory if you have one user id 
having lot of productName and productCatagory.
Keep in mind that Spark on the reducer side still use a Hash to merge all the 
data from different mappers, so the memory in the reduce side has to be able to 
merge all the productionName + productCatagory for the most frequently shown up 
user id (at least), and I don't know why you want all the productName and 
productCategory per user Id (Maybe a distinct could be enough?).
Image you have one user id show up 1M time in your dataset, with 0.5M 
productname as 'A', and 0.5M product name as 'B', and your query will push 1M 
of 'A' and 'B' into the same reducer, and ask Spark to merge them in the 
HashMap for you for that user Id. This will cause OOM.
Above all, you need to find out what is the max count per user id in your data: 
select max(count(*)) from land where . group by userid
Your memory has to support that amount of productName and productCatagory, and 
if your partition number is not high enough (even as your unique count of user 
id), if that is really what you want, to consolidate all the productionName and 
product catagory together, without even consider removing duplication.
But both query still should push similar records count per partition, but with 
much of different volume size of data.
Yong

Subject: Re: Java Heap Space Error
From: yu...@useinsider.com
Date: Thu, 24 Sep 2015 18:56:51 +0300
CC: jingyu.zh...@news.com.au; user@spark.apache.org
To: java8...@hotmail.com

Yes right, the query you wrote worked in same cluster. In this case, partitions 
were equally distributed but when i used regex and concetanations it’s not as i 
said before. Query with concetanation is below:
val usersInputDF = sqlContext.sql(
  s"""
 |  select userid,concat_ws(' ',collect_list(concat_ws(' ',if(productname 
is not 
NULL,lower(productname),''),lower(regexp_replace(regexp_replace(substr(productcategory,2,length(productcategory)-2),'\"',''),\",\",'
 ') inputlist from landing where 
dt='${dateUtil.getYear}-${dateUtil.getMonth}' and day >= '${day}' and userid != 
'' and userid is not null and userid is not NULL and pagetype = 'productDetail' 
group by userid

   """.stripMargin)

On 24 Sep 2015, at 16:52, java8964 <java8...@hotmail.com> wrote:This is 
interesting.
So you mean that query as 
"select userid from landing where dt='2015-9' and userid != '' and userid is 
not null and userid is not NULL and pagetype = 'productDetail' group by userid"
works in your cluster?
In this case, do you also see this one task with way more data than the rest, 
as it happened when you use regex and concatenation?
It is hard to believe that just add "regex" and "concatenation" will make the 
distribution more equally across partitions. In your query, the distribution in 
the partitions simply depends on the Hash partitioner of "userid".
Can you show us the query after you add "regex" and "concatenation"?
Yong

Subject: Re: Java Heap Space Error
From: yu...@useinsider.com
Date: Thu, 24 Sep 2015 15:34:48 +0300
CC: user@spark.apache.org
To: jingyu.zh...@news.com.au; java8...@hotmail.com

@JingyuYes, it works without regex and concatenation as the query below:
So, what we can understand from this? Because when i do like that, shuffle read 
sizes are equally distributed between partitions.
val usersInputDF = sqlContext.sql(s""" |  select userid from landing 
where dt='2015-9' and userid != '' and userid is not null and userid is not 
NULL and pagetype = 'productDetail' group by userid
   """.stripMargin)
@java8964
I tried with sql.shuffle.partitions = 1 but no luck. It’s again one of the 
partitions shuffle size is huge and the others are very small.

——So how can i balance this shuffle read size between partitions?

On 24 Sep 2015, at 03:35, Zhang, Jingyu <jingyu.zh...@news.com.au> wrote:Is you 
sql works if do not runs a regex on strings and concatenates them, I mean just 
Select the stuff without String operations?

On 24 September 2015 at 10:11, java8964 <java8...@hotmail.com> wrote:
Try to increase partitions count, that will make each partition has less data.
Yong

Subject: Re: Java Heap Space Error
From: yu...@useinsider.com
Date: Thu, 24 Sep 2015 00:32:47 +0300
CC: user@spark.apache.org
To: java8...@hotmail.com

Yes, it’s possible. I use S3 as data source. My external tables has 
partitioned. Belowed task is 193/200. Job has 2 stages and its 193. task of 200 
in 2.stage because of sql.shuffle.partitions. 
How can i avoid this situation, this is my query:
select use

RE: Java Heap Space Error

2015-09-23 Thread java8964

Try to increase partitions count, that will make each partition has less data.
Yong

Subject: Re: Java Heap Space Error
From: yu...@useinsider.com
Date: Thu, 24 Sep 2015 00:32:47 +0300
CC: user@spark.apache.org
To: java8...@hotmail.com

Yes, it’s possible. I use S3 as data source. My external tables has 
partitioned. Belowed task is 193/200. Job has 2 stages and its 193. task of 200 
in 2.stage because of sql.shuffle.partitions. 
How can i avoid this situation, this is my query:
select userid,concat_ws(' ',collect_list(concat_ws(' ',if(productname is not 
NULL,lower(productname),''),lower(regexp_replace(regexp_replace(substr(productcategory,2,length(productcategory)-2),'\"',''),\",\",'
 ') inputlist from landing where dt='2015-9' and userid != '' and userid is 
not null and userid is not NULL and pagetype = 'productDetail' group by userid

On 23 Sep 2015, at 23:55, java8964 <java8...@hotmail.com> wrote:Based on your 
description, you job shouldn't have any shuffle then, as you just apply regex 
and concatenation on the column, but there is one partition having 4.3M records 
to be read, vs less than 1M records for other partitions.
Is that possible? It depends on what is the source of your data.
If there is shuffle in your query (More than 2 stages generated by your query, 
and this is my guess of what happening), then it simple means that one 
partition having way more data than the rest of partitions.
Yong
From: yu...@useinsider.com
Subject: Java Heap Space Error
Date: Wed, 23 Sep 2015 23:07:17 +0300
To: user@spark.apache.org

What can cause this issue in the attached picture? I’m running and sql query 
which runs a regex on strings and concatenates them. Because of this task, my 
job gives java heap space error.

RE: Debugging too many files open exception issue in Spark shuffle

2015-09-23 Thread java8964

That is interesting.
I don't have any Mesos experience, but just want to know the reason why it does 
so.
Yong

> Date: Wed, 23 Sep 2015 15:53:54 -0700
> Subject: Debugging too many files open exception issue in Spark shuffle
> From: dbt...@dbtsai.com
> To: user@spark.apache.org
> 
> Hi,
> 
> Recently, we ran into this notorious exception while doing large
> shuffle in mesos at Netflix. We ensure that `ulimit -n` is a very
> large number, but still have the issue.
> 
> It turns out that mesos overrides the `ulimit -n` to a small number
> causing the problem. It's very non-trivial to debug (as logging in on
> the slave gives the right ulimit - it's only in the mesos context that
> it gets overridden).
> 
> Here is the code you can run in Spark shell to get the actual allowed
> # of open files for Spark.
> 
> import sys.process._
> val p = 1 to 100
> val rdd = sc.parallelize(p, 100)
> val openFiles = rdd.map(x=> Seq("sh", "-c", "ulimit
> -n").!!.toDouble.toLong).collect
> 
> Hope this can help someone in the same situation.
> 
> Sincerely,
> 
> DB Tsai
> --
> Blog: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

RE: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-22 Thread java8964

Or at least tell us how many partitions you are using.
Yong

> Date: Tue, 22 Sep 2015 02:06:15 -0700
> From: belevts...@gmail.com
> To: user@spark.apache.org
> Subject: Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables
> 
> Could it be that your data is skewed? Do you have variable-length column
> types?
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24762.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

RE: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread java8964

Your performance problem sounds like in the driver, which is trying to 
boardcast 10k files by itself alone, which becomes the bottle neck.
What you wants is just transfer the data from AVRO format per file to another 
format. In MR, most likely each mapper process one file, and you utilized the 
whole cluster, instead of just using the Driver in MR.
Not sure exactly how to help you, but to do that in the Spark:
1) Disable the boardcast from the driver, let the each task in the Spark to 
process one file. Maybe use something like hadoop NLineInputFormat, which 
including all the filenames of your data, so each Spark task will receive the 
HDFS location of each file, then start the transform logic. In this case, you 
concurrently transform all your small files by using all the available cores of 
your executors.2) If above sounds too complex, you need to find the way to 
disable boardcasting small files from the Spark Driver. This sounds like a good 
normal way to handle small files, but I cannot find a configuration to force 
spark disable it.
Yong

From: daniel.ha...@veracity-group.com
Subject: Re: spark-avro takes a lot time to load thousands of files
Date: Tue, 22 Sep 2015 16:54:26 +0300
CC: user@spark.apache.org
To: jcove...@gmail.com

I Agree but it's a constraint I have to deal with.The idea is load these files 
and merge them into ORC.When using hive on Tez it takes less than a minute. 
Daniel
On 22 בספט׳ 2015, at 16:00, Jonathan Coveney  wrote:

having a file per record is pretty inefficient on almost any file system

El martes, 22 de septiembre de 2015, Daniel Haviv 
 escribió:
Hi,We are trying to load around 10k avro files (each file holds only one 
record) using spark-avro but it takes over 15 minutes to load.It seems that 
most of the work is being done at the driver where it created a broadcast 
variable for each file.
Any idea why is it behaving that way ?Thank you.Daniel

RE: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-22 Thread java8964

Or at least tell us how many partitions you are using.
Yong

> Date: Tue, 22 Sep 2015 02:06:15 -0700
> From: belevts...@gmail.com
> To: user@spark.apache.org
> Subject: Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables
> 
> Could it be that your data is skewed? Do you have variable-length column
> types?
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24762.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

RE: application failed on large dataset

2015-09-16 Thread java8964

Can you try for "nio", instead of "netty".
set "spark.shuffle.blockTransferService", to "nio" and give it a try.
Yong
From: z.qian...@gmail.com
Date: Wed, 16 Sep 2015 03:21:02 +
Subject: Re: application failed on large dataset
To: java8...@hotmail.com; user@spark.apache.org

Hi,   after check with the yarn logs, all the error stack looks like below:
15/09/15 19:58:23 ERROR shuffle.OneForOneBlockFetcher: Failed while starting 
block fetchesjava.io.IOException: Connection reset by peerat 
sun.nio.ch.FileDispatcherImpl.read0(Native Method)at 
sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)at 
sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)at 
sun.nio.ch.IOUtil.read(IOUtil.java:192)at 
sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) 
   at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) 
   at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
It seems that some error occurs when try to fetch the block, and after 
several retries, the executor just dies with such error.And for your 
question, I did not see any executor restart during the job.PS: the 
operator I am using during that stage if rdd.glom().mapPartitions()

java8964 <java8...@hotmail.com>于2015年9月15日周二 下午11:44写道：

When you saw this error, does any executor die due to whatever error?
Do you check to see if any executor restarts during your job?
It is hard to help you just with the stack trace. You need to tell us the whole 
picture when your jobs are running.
Yong

From: qhz...@apache.org
Date: Tue, 15 Sep 2015 15:02:28 +
Subject: Re: application failed on large dataset
To: user@spark.apache.org

has anyone met the same problems?
周千昊 <qhz...@apache.org>于2015年9月14日周一 下午9:07写道：
Hi, community  I am facing a strange problem:  all executors does not 
respond, and then all of them failed with the ExecutorLostFailure.  when I 
look into yarn logs, there are full of such exception
15/09/14 04:35:33 ERROR shuffle.RetryingBlockFetcher: Exception while beginning 
fetch of 1 outstanding blocks (after 3 retries)java.io.IOException: Failed to 
connect to host/ip:portat 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
   at java.lang.Thread.run(Thread.java:745)Caused by: 
java.net.ConnectException: Connection refused: host/ip:portat 
sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) 
   at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
... 1 more

  The strange thing is that, if I r

RE: application failed on large dataset

2015-09-16 Thread java8964

 06:14:36 INFO 
nio.ConnectionManager: Removing ReceivingConnection to 
ConnectionManagerId()15/09/16 06:14:36 ERROR nio.ConnectionManager: 
Corresponding SendingConnection to ConnectionManagerId() not found15/09/16 
06:14:36 INFO nio.ConnectionManager: Key not valid ? 
sun.nio.ch.SelectionKeyImpl@3011c7c915/09/16 06:14:36 INFO 
nio.ConnectionManager: key already cancelled ? 
sun.nio.ch.selectionkeyi...@3011c7c9java.nio.channels.CancelledKeyException 
   at 
org.apache.spark.network.nio.ConnectionManager.run(ConnectionManager.scala:461) 
   at 
org.apache.spark.network.nio.ConnectionManager$$anon$7.run(ConnectionManager.scala:193)
java8964 <java8...@hotmail.com>于2015年9月16日周三 下午8:17写道：



Can you try for "nio", instead of "netty".
set "spark.shuffle.blockTransferService", to "nio" and give it a try.
Yong
From: z.qian...@gmail.com
Date: Wed, 16 Sep 2015 03:21:02 +
Subject: Re: application failed on large dataset
To: java8...@hotmail.com; user@spark.apache.org

Hi,   after check with the yarn logs, all the error stack looks like below:
15/09/15 19:58:23 ERROR shuffle.OneForOneBlockFetcher: Failed while starting 
block fetchesjava.io.IOException: Connection reset by peerat 
sun.nio.ch.FileDispatcherImpl.read0(Native Method)at 
sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)at 
sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)at 
sun.nio.ch.IOUtil.read(IOUtil.java:192)at 
sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) 
   at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) 
   at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
It seems that some error occurs when try to fetch the block, and after 
several retries, the executor just dies with such error.And for your 
question, I did not see any executor restart during the job.PS: the 
operator I am using during that stage if rdd.glom().mapPartitions()

java8964 <java8...@hotmail.com>于2015年9月15日周二 下午11:44写道：



When you saw this error, does any executor die due to whatever error?
Do you check to see if any executor restarts during your job?
It is hard to help you just with the stack trace. You need to tell us the whole 
picture when your jobs are running.
Yong

From: qhz...@apache.org
Date: Tue, 15 Sep 2015 15:02:28 +
Subject: Re: application failed on large dataset
To: user@spark.apache.org

has anyone met the same problems?
周千昊 <qhz...@apache.org>于2015年9月14日周一 下午9:07写道：
Hi, community  I am facing a strange problem:  all executors does not 
respond, and then all of them failed with the ExecutorLostFailure.  when I 
look into yarn logs, there are full of such exception
15/09/14 04:35:33 ERROR shuffle.RetryingBlockFetcher: Exception while beginning 
fetch of 1 outstanding blocks (after 3 retries)java.io.IOException: Failed to 
connect to host/ip:portat 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
   at java.lang.Thread.run(Thread.java:745)Caused by: 
java.net.ConnectException: Connection refused: host/ip:portat 
sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at 
io

RE: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread java8964

If you use Standalone mode, just start spark-shell like following:
spark-shell --jars your_uber_jar --conf spark.files.userClassPathFirst=true 
Yong
Date: Tue, 15 Sep 2015 09:33:40 -0500
Subject: Re: Change protobuf version or any other third party library version 
in Spark application
From: ljia...@gmail.com
To: java8...@hotmail.com
CC: ste...@hortonworks.com; user@spark.apache.org

Steve,
Thanks for the input. You are absolutely right. When I use protobuf 2.6.1, I 
also ran into method not defined errors. You suggest using Maven sharding 
strategy, but I have already built the uber jar to package all my custom 
classes and its dependencies including protobuf 3. The problem is how to 
configure spark shell to use my uber jar first. 
java8964 -- appreciate the link and I will try the configuration. Looks 
promising. However, the "user classpath first" attribute does not apply to 
spark-shell, am I correct? 

Lan
On Tue, Sep 15, 2015 at 8:24 AM, java8964 <java8...@hotmail.com> wrote:

It is a bad idea to use the major version change of protobuf, as it most likely 
won't work.
But you really want to give it a try, set the "user classpath first", so the 
protobuf 3 coming with your jar will be used.
The setting depends on your deployment mode, check this for the parameter:
https://issues.apache.org/jira/browse/SPARK-2996
Yong

Subject: Re: Change protobuf version or any other third party library version 
in Spark application
From: ste...@hortonworks.com
To: ljia...@gmail.com
CC: user@spark.apache.org
Date: Tue, 15 Sep 2015 09:19:28 +

On 15 Sep 2015, at 05:47, Lan Jiang <ljia...@gmail.com> wrote:

Hi, there,

I am using Spark 1.4.1. The protobuf 2.5 is included by Spark 1.4.1 by default. 
However, I would like to use Protobuf 3 in my spark application so that I can 
use some new features such as Map support.  Is there anyway to do that? 

Right now if I build a uber.jar with dependencies including protobuf 3 classes 
and pass to spark-shell through --jars option, during the execution, I got the 
error java.lang.NoSuchFieldError: unknownFields. 

protobuf is an absolute nightmare version-wise, as protoc generates 
incompatible java classes even across point versions. Hadoop 2.2+ is and will 
always be protobuf 2.5 only; that applies transitively to downstream projects  
(the great protobuf upgrade
 of 2013 was actually pushed by the HBase team, and required a co-ordinated 
change across multiple projects)

Is there anyway to use a different version of Protobuf other than the default 
one included in the Spark distribution? I guess I can generalize and extend the 
question to any third party libraries. How to deal with version conflict for 
any third
 party libraries included in the Spark distribution? 

maven shading is the strategy. Generally it is less needed, though the 
troublesome binaries are,  across the entire apache big data stack:

google protobuf
google guava
kryo

jackson

you can generally bump up the other versions, at least by point releases.

RE: application failed on large dataset

2015-09-15 Thread java8964

When you saw this error, does any executor die due to whatever error?
Do you check to see if any executor restarts during your job?
It is hard to help you just with the stack trace. You need to tell us the whole 
picture when your jobs are running.
Yong

From: qhz...@apache.org
Date: Tue, 15 Sep 2015 15:02:28 +
Subject: Re: application failed on large dataset
To: user@spark.apache.org

has anyone met the same problems?
周千昊 于2015年9月14日周一 下午9:07写道：
Hi, community  I am facing a strange problem:  all executors does not 
respond, and then all of them failed with the ExecutorLostFailure.  when I 
look into yarn logs, there are full of such exception
15/09/14 04:35:33 ERROR shuffle.RetryingBlockFetcher: Exception while beginning 
fetch of 1 outstanding blocks (after 3 retries)java.io.IOException: Failed to 
connect to host/ip:portat 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
   at java.lang.Thread.run(Thread.java:745)Caused by: 
java.net.ConnectException: Connection refused: host/ip:portat 
sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) 
   at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
... 1 more

  The strange thing is that, if I reduce the input size, the problems just 
disappeared. I have found a similar issue in the 
mail-archive(http://mail-archives.us.apache.org/mod_mbox/spark-user/201502.mbox/%3CCAOHP_tHRtuxDfWF0qmYDauPDhZ1=MAm5thdTfgAhXDN=7kq...@mail.gmail.com%3E),
 however I didn't see the solution. So I am wondering if anyone could help with 
that?
  My env is:  hdp 2.2.6  spark(1.4.1)  mode: yarn-client  
spark-conf:  spark.driver.extraJavaOptions -Dhdp.version=2.2.6.0-2800  
spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.6.0-2800  
spark.executor.memory 6g  spark.storage.memoryFraction 0.3  
spark.dynamicAllocation.enabled true
  spark.shuffle.service.enabled true

RE: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread java8964

It is a bad idea to use the major version change of protobuf, as it most likely 
won't work.
But you really want to give it a try, set the "user classpath first", so the 
protobuf 3 coming with your jar will be used.
The setting depends on your deployment mode, check this for the parameter:
https://issues.apache.org/jira/browse/SPARK-2996
Yong

Subject: Re: Change protobuf version or any other third party library version 
in Spark application
From: ste...@hortonworks.com
To: ljia...@gmail.com
CC: user@spark.apache.org
Date: Tue, 15 Sep 2015 09:19:28 +













On 15 Sep 2015, at 05:47, Lan Jiang  wrote:


Hi, there,



I am using Spark 1.4.1. The protobuf 2.5 is included by Spark 1.4.1 by default. 
However, I would like to use Protobuf 3 in my spark application so that I can 
use some new features such as Map support.  Is there anyway to do that? 



Right now if I build a uber.jar with dependencies including protobuf 3 classes 
and pass to spark-shell through --jars option, during the execution, I got the 
error java.lang.NoSuchFieldError: unknownFields. 









protobuf is an absolute nightmare version-wise, as protoc generates 
incompatible java classes even across point versions. Hadoop 2.2+ is and will 
always be protobuf 2.5 only; that applies transitively to downstream projects  
(the great protobuf upgrade
 of 2013 was actually pushed by the HBase team, and required a co-ordinated 
change across multiple projects)








Is there anyway to use a different version of Protobuf other than the default 
one included in the Spark distribution? I guess I can generalize and extend the 
question to any third party libraries. How to deal with version conflict for 
any third
 party libraries included in the Spark distribution? 







maven shading is the strategy. Generally it is less needed, though the 
troublesome binaries are,  across the entire apache big data stack:


google protobuf
google guava
kryo

jackson



you can generally bump up the other versions, at least by point releases.

RE: Best way to merge final output part files created by Spark job

2015-09-14 Thread java8964

For text file, this merge works fine, but for binary format like "ORC", 
"Parquet" or "AVOR", not sure this will work.
These kind of formats in fact are not append-able, as they write the detail 
data information either in the head or at tail part of the file.
You have to use the format specified API to merge the data.
Yong

Date: Mon, 14 Sep 2015 09:10:33 +0200
Subject: Re: Best way to merge final output part files created by Spark job
From: gmu...@stratio.com
To: umesh.ka...@gmail.com
CC: user@spark.apache.org

Hi, check out  FileUtil.copyMerge function in the Hadoop API.  
It's simple,  
Get the hadoop configuration from Spark Context  FileSystem fs = 
FileSystem.get(sparkContext.hadoopConfiguration());
Create new Path with destination and source directory.Call copyMerge   
FileUtil.copyMerge(fs, inputPath, fs, destPath, true, 
sparkContext.hadoopConfiguration(), null);
2015-09-13 23:25 GMT+02:00 unk1102 :
Hi I have a spark job which creates around 500 part files inside each

directory I process. So I have thousands of such directories. So I need to

merge these small small 500 part files. I am using

spark.sql.shuffle.partition as 500 and my final small files are ORC files.

Is there a way to merge orc files in Spark if not please suggest the best

way to merge files created by Spark job in hdfs please guide. Thanks much.

--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-merge-final-output-part-files-created-by-Spark-job-tp24681.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.

-

To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org

-- 
Gaspar Muñoz 
@gmunozsoria
Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, MadridTel: +34 91 352 59 42 // @stratiobd

RE: Calculating Min and Max Values using Spark Transformations?

2015-08-28 Thread java8964

Or RDD.max() and RDD.min() won't work for you?
Yong

Subject: Re: Calculating Min and Max Values using Spark Transformations?
To: as...@wso2.com
CC: user@spark.apache.org
From: jfc...@us.ibm.com
Date: Fri, 28 Aug 2015 09:28:43 -0700

If you already loaded csv data into a dataframe, why not register it as a 
table, and use Spark SQL

to find max/min or any other aggregates? SELECT MAX(column_name) FROM 
dftable_name ... seems natural.

JESSE CHEN

Big Data Performance | IBM Analytics

Office:  408 463 2296

Mobile: 408 828 9068

Email:   jfc...@us.ibm.com

ashensw ---08/28/2015 05:40:07 AM---Hi all, I have a dataset which consist of 
large number of features(columns). It is

From:   ashensw as...@wso2.com

To: user@spark.apache.org

Date:   08/28/2015 05:40 AM

Subject:Calculating Min and Max Values using Spark Transformations?

Hi all,

I have a dataset which consist of large number of features(columns). It is

in csv format. So I loaded it into a spark dataframe. Then I converted it

into a JavaRDDRow Then using a spark transformation I converted that into

JavaRDDString[]. Then again converted it into a JavaRDDdouble[]. So now

I have a JavaRDDdouble[]. So is there any method to calculate max and min

values of each columns in this JavaRDDdouble[] ?  

Or Is there any way to access the array if I store max and min values to a

array inside the spark transformation class?

Thanks.

--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Calculating-Min-and-Max-Values-using-Spark-Transformations-tp24491.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.

-

To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org

RE: How to avoid shuffle errors for a large join ?

2015-08-28 Thread java8964

There are several possibilities here.
1) Keep in mind that 7GB data will need way more than 7G heap, as deserialize 
java object needs much more space than data itself. Grand rule is multiple 6 to 
8 times, so 7G data need 50G heap space.2) You should monitor the Spark UI, to 
check how many records being processed by task, and if the failed tasks have 
more data than the rest. Even current you have tasks failed, you will also have 
the tasks succeeded. Compare them, does the failed tasks process way more 
records than the succeeded ones? If so, it indicates you have data skew 
problem.3) If the failed tasks allocated similar records as succeeded ones, 
then you just add more partitions, to make each task processing less data, You 
should always monitor the GC output in these cases.4) If most of your tasks 
failed due to memory, then your setting is too small for your data, adding 
partitions or memory.

Yong

From: tom...@gmail.com
Date: Fri, 28 Aug 2015 13:55:52 -0700
Subject: Re: How to avoid shuffle errors for a large join ?
To: ja...@jasonknight.us
CC: user@spark.apache.org

Yeah, I tried with 10k and 30k and these still failed, will try with more then. 
Though that is a little disappointing, it only writes ~7TB of shuffle data 
which shouldn't in theory require more than 1000 reducers on my 10TB memory 
cluster (~7GB of spill per reducer).
I'm now wondering if my shuffle partitions are uneven and I should use a custom 
partitioner, is there a way to get stats on the partition sizes from Spark ?
On Fri, Aug 28, 2015 at 12:46 PM, Jason ja...@jasonknight.us wrote:
I had similar problems to this (reduce side failures for large joins (25bn rows 
with 9bn)), and found the answer was to further up the 
spark.sql.shuffle.partitions=1000. In my case, 16k partitions worked for me, 
but your tables look a little denser, so you may want to go even higher.
On Thu, Aug 27, 2015 at 6:04 PM Thomas Dudziak tom...@gmail.com wrote:
I'm getting errors like Removing executor with no recent heartbeats  
Missing an output location for shuffle errors for a large SparkSql join (1bn 
rows/2.5TB joined with 1bn rows/30GB) and I'm not sure how to configure the job 
to avoid them.

The initial stage completes fine with some 30k tasks on a cluster with 70 
machines/10TB memory, generating about 6.5TB of shuffle writes, but then the 
shuffle stage first waits 30min in the scheduling phase according to the UI, 
and then dies with the mentioned errors.

I can see in the GC logs that the executors reach their memory limits (32g per 
executor, 2 workers per machine) and can't allocate any more stuff in the heap. 
Fwiw, the top 10 in the memory use histogram are:

num #instances #bytes  class 
name--   1: 249139595
11958700560  scala.collection.immutable.HashMap$HashMap1   2: 251085327 
8034730464  scala.Tuple2   3: 243694737 5848673688  java.lang.Float   
4: 231198778 5548770672  java.lang.Integer   5:  72191585 
4298521576  [Lscala.collection.immutable.HashMap;   6:  72191582 
2310130624  scala.collection.immutable.HashMap$HashTrieMap   7:  74114058   
  1778737392  java.lang.Long   8:   6059103  779203840  
[Ljava.lang.Object;   9:   5461096  174755072  
scala.collection.mutable.ArrayBuffer  10: 34749   70122104  [B
Relevant settings are (Spark 1.4.1, Java 8 with G1 GC):

spark.core.connection.ack.wait.timeout 600spark.executor.heartbeatInterval  
 60sspark.executor.memory  32gspark.mesos.coarse
 falsespark.network.timeout  
600sspark.shuffle.blockTransferService nettyspark.shuffle.consolidateFiles  
   truespark.shuffle.file.buffer  1mspark.shuffle.io.maxRetries 
   6spark.shuffle.manager  sort
The join is currently configured with spark.sql.shuffle.partitions=1000 but 
that doesn't seem to help. Would increasing the partitions help ? Is there a 
formula to determine an approximate partitions number value for a join ?
Any help with this job would be appreciated !
cheers,Tom

RE: query avro hive table in spark sql

2015-08-27 Thread java8964

What version of the Hive you are using? And do you compile to the right version 
of Hive when you compiled Spark?
BTY, spark-avro works great for our experience, but still, some non-tech people 
just want to use as a SQL shell in spark, like HIVE-CLI.
Yong

From: mich...@databricks.com
Date: Wed, 26 Aug 2015 17:48:44 -0700
Subject: Re: query avro hive table in spark sql
To: gpatc...@gmail.com
CC: user@spark.apache.org

I'd suggest looking at http://spark-packages.org/package/databricks/spark-avro
On Wed, Aug 26, 2015 at 11:32 AM, gpatcham gpatc...@gmail.com wrote:
Hi,



I'm trying to query hive table which is based on avro in spark SQL and

seeing below errors.



15/08/26 17:51:12 WARN avro.AvroSerdeUtils: Encountered AvroSerdeException

determining schema. Returning signal schema to indicate problem

org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Neither

avro.schema.literal nor avro.schema.url specified, can't determine table

schema

at

org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:68)

at

org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrReturnErrorSchema(AvroSerdeUtils.java:93)

at

org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:60)

at

org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:375)

at

org.apache.hadoop.hive.ql.metadata.Partition.getDeserializer(Partition.java:249)





Its not able to determine schema. Hive table is pointing to avro schema

using url. I'm stuck and couldn't find more info on this.



Any pointers ?







--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/query-avro-hive-table-in-spark-sql-tp24462.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.



-

To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org

RE: query avro hive table in spark sql

2015-08-27 Thread java8964

You can run hive query in the spark-avro, but you cannot query the hive view in 
the spark-avro, as the view is stored in the Hive metadata.
What do you mean the right version of spark, then can't determine table 
schema problem is fixed? I faced this problem before, and my guess is the Hive 
library mismatch causing it, but not sure.
I never faced your 2nd problem, can you post the whole stack for that error?
Most of our datasets are also in AVRO format.
Yong

Date: Thu, 27 Aug 2015 09:45:45 -0700
Subject: Re: query avro hive table in spark sql
From: gpatc...@gmail.com
To: java8...@hotmail.com
CC: mich...@databricks.com; user@spark.apache.org

can we run hive queries using spark-avro ?
In our case its not just reading the avro file. we have view in hive which is 
based on multiple tables.
On Thu, Aug 27, 2015 at 9:41 AM, Giri P gpatc...@gmail.com wrote:
we are using hive1.1 . 
I was able to fix below error when I used right version spark
15/08/26 17:51:12 WARN avro.AvroSerdeUtils: Encountered 
AvroSerdeExceptiondetermining schema. Returning signal schema to indicate 
problemorg.apache.hadoop.hive.serde2.avro.AvroSerdeException: 
Neitheravro.schema.literal nor avro.schema.url specified, can't determine 
tableschema
atorg.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:68)

atorg.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrReturnErrorSchema(AvroSerdeUtils.java:93)

atorg.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:60)

atorg.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:375)

atorg.apache.hadoop.hive.ql.metadata.Partition.getDeserializer(Partition.java:249)



But I still see this error when querying on some hive avro tables.
15/08/26 17:51:27 WARN
scheduler.TaskSetManager: Lost task 30.0 in stage 0.0 (TID 14,
dtord01hdw0227p.dc.dotomi.net):
org.apache.hadoop.hive.serde2.avro.BadSchemaException

   
at org.apache.hadoop.hive.serde2.avro.AvroSerDe.deserialize(AvroSerDe.java:91)

   
at
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:321)

   at
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:320)
I haven't tried spark-avro. We are using Sqlcontext to run queries in our 
application
Any idea if this issue might be coz of querying across different schema version 
of data ?
ThanksGiri
On Thu, Aug 27, 2015 at 5:39 AM, java8964 java8...@hotmail.com wrote:



What version of the Hive you are using? And do you compile to the right version 
of Hive when you compiled Spark?
BTY, spark-avro works great for our experience, but still, some non-tech people 
just want to use as a SQL shell in spark, like HIVE-CLI.
Yong

From: mich...@databricks.com
Date: Wed, 26 Aug 2015 17:48:44 -0700
Subject: Re: query avro hive table in spark sql
To: gpatc...@gmail.com
CC: user@spark.apache.org

I'd suggest looking at http://spark-packages.org/package/databricks/spark-avro
On Wed, Aug 26, 2015 at 11:32 AM, gpatcham gpatc...@gmail.com wrote:
Hi,



I'm trying to query hive table which is based on avro in spark SQL and

seeing below errors.



15/08/26 17:51:12 WARN avro.AvroSerdeUtils: Encountered AvroSerdeException

determining schema. Returning signal schema to indicate problem

org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Neither

avro.schema.literal nor avro.schema.url specified, can't determine table

schema

at

org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:68)

at

org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrReturnErrorSchema(AvroSerdeUtils.java:93)

at

org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:60)

at

org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:375)

at

org.apache.hadoop.hive.ql.metadata.Partition.getDeserializer(Partition.java:249)





Its not able to determine schema. Hive table is pointing to avro schema

using url. I'm stuck and couldn't find more info on this.



Any pointers ?







--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/query-avro-hive-table-in-spark-sql-tp24462.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.



-

To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org

RE: Protobuf error when streaming from Kafka

2015-08-25 Thread java8964

Did your spark build with Hive?
I met the same problem before because the hive-exec jar in the maven itself 
include protobuf class, which will be included in the Spark jar.
Yong

Date: Tue, 25 Aug 2015 12:39:46 -0700
Subject: Re: Protobuf error when streaming from Kafka
From: lcas...@gmail.com
To: yuzhih...@gmail.com
CC: user@spark.apache.org

Hi,
 I am using Spark-1.4 and Kafka-0.8.2.1
As per google suggestions, I rebuilt all the classes with protobuff-2.5 
dependencies. My new protobuf is compiled using 2.5. However now, my spark job 
does not start. Its throwing different error. Does Spark or any other its 
dependencies uses old protobuff-2.4?

Exception in thread main java.lang.VerifyError: class 
com.apple.ist.retail.xcardmq.serializers.SampleProtobufMessage$ProtoBuff 
overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at 
com.apple.ist.retail.spark.kafka.consumer.SparkMQProcessor.start(SparkProcessor.java:68)
at 
com.apple.ist.retail.spark.kafka.consumer.SparkMQConsumer.main(SparkConsumer.java:43)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

On Mon, Aug 24, 2015 at 6:53 PM, Ted Yu yuzhih...@gmail.com wrote:
Can you show the complete stack trace ?
Which Spark / Kafka release are you using ?
Thanks
On Mon, Aug 24, 2015 at 4:58 PM, Cassa L lcas...@gmail.com wrote:
Hi,
 I am storing messages in Kafka using protobuf and reading them into Spark. I 
upgraded protobuf version from 2.4.1 to 2.5.0. I got 
java.lang.UnsupportedOperationException for older messages. However, even for 
new messages I get the same error. Spark does convert it though. I see my 
messages. How do I get rid of this error?
java.lang.UnsupportedOperationException: This is supposed to be overridden by 
subclasses.
at 
com.google.protobuf.GeneratedMessage.getUnknownFields(GeneratedMessage.java:180)
at 
org.apache.hadoop.hdfs.protocol.proto.HdfsProtos$FsPermissionProto.getSerializedSize(HdfsProtos.java:5407)
at 
com.google.protobuf.CodedOutputStream.computeMessageSizeNoTag(CodedOutputStream.java:749)

SparkSQL problem with IBM BigInsight V3

2015-08-25 Thread java8964

Hi, On our production environment, we have a unique problems related to Spark 
SQL, and I wonder if anyone can give me some idea what is the best way to 
handle this.
Our production Hadoop cluster is IBM BigInsight Version 3, which comes with 
Hadoop 2.2.0 and Hive 0.12.
Right now, we build spark 1.3.1 ourselves and point to the above versions 
during the build.
Now, here is the problem related to Spark SQL that it cannot query partitioned 
Hive tables. It has no problem to query non-partitioned Hive tables in Spark 
SQL.
The error in the Spark SQL for querying partitioned Hive tables like following:
javax.jdo.JDODataStoreException: Error executing SQL query select 
PARTITIONS.PART_ID from PARTITIONS  inner join TBLS on PARTITIONS.TBL_ID = 
TBLS.TBL_ID   inner join DBS on TBLS.DB_ID = DBS.DB_ID  where TBLS.TBL_NAME = ? 
and DBS.NAME = ?.at 
org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:451)
   at org.datanucleus.api.jdo.JDOQuery.executeWithArray(JDOQuery.java:321)  
  
...NestedThrowablesStackTrace:com.ibm.db2.jcc.am.SqlSyntaxErrorException:
 DB2 SQL Error: SQLCODE=-204, SQLSTATE=42704, SQLERRMC=CATALOG.PARTITIONS, 
DRIVER=4.17.36
The Hive metadata of BigInsight V3 is stored in DB2 (Don't ask me why, as it is 
from IBM), and the above error from DB2 simple means Table NOT FOUND.If I 
change the above query like following:
select PARTITIONS.PART_ID from HIVE.PARTITIONS as PARTITIONS inner join 
HIVE.TBLS as TBLS  on PARTITIONS.TBL_ID = TBLS.TBL_ID   inner join HIVE.DBS as 
DBS on TBLS.DB_ID = DBS.DB_ID  where TBLS.TBL_NAME = ? and DBS.NAME = ?
and the query will work without any problem. My guess is that IBM changed some 
part of Hive, to make it can use DB2 as the underline database for Hive. In 
DB2, it has DB instance, schema and objects. In fact, table PARTITIONS, 
TBLS and DBS are all existed in the DB2, but under HIVE schema.
Funny thing is that for unpartitioned table, the Spark SQL just works fine with 
DB2 as Hive metadata store.
So my options are:
1) Wait for IBM V4.0, which will include Spark, and they will make it work, but 
don't know when that will happen.2) Build Spark with the Hive jar provided from 
IBM BigInsight, assume these hive jars will work with DB2?3) Modify some part 
of Spark SQL code, to make it works with DB2?
My feeling is option 3 is the best, but not sure where to start. 
Thanks
Yong
db2 = select schemaname from syscat.schemata
SCHEMANAME..HIVE..
db2 = list tables for schema hive
Table/View  Schema  Type  Creation 
time--- --- - 
--BUCKETING_COLS  HIVET 
2015-08-05-00.09.08.676983CDS HIVET 
2015-08-05-00.08.38.861789COLUMNS HIVET 
2015-08-05-00.08.56.542476COLUMNS_V2  HIVET 
2015-08-05-00.08.36.270223DATABASE_PARAMS HIVET 
2015-08-05-00.08.32.453663DBS HIVET 
2015-08-05-00.08.29.642279DB_PRIVSHIVET 
2015-08-05-00.08.41.411732DELEGATION_TOKENS   HIVET 
2015-08-05-00.41.45.202784GLOBAL_PRIVSHIVET 
2015-08-05-00.08.52.636188IDXSHIVET 
2015-08-05-00.08.43.117673INDEX_PARAMSHIVET 
2015-08-05-00.08.44.636557MASTER_KEYS HIVET 
2015-08-05-00.41.43.849242NUCLEUS_TABLES  HIVET 
2015-08-05-00.09.11.451975PARTITIONS  HIVET 
2015-08-05-00.08.45.919837PARTITION_EVENTSHIVET 
2015-08-05-00.08.55.244342PARTITION_KEYS  HIVET 
2015-08-05-00.09.01.802570PARTITION_KEY_VALS  HIVET 
2015-08-05-00.08.40.103345PARTITION_PARAMSHIVET 
2015-08-05-00.08.53.992383PART_COL_PRIVS  HIVET 
2015-08-05-00.09.03.225567PART_COL_STATS  HIVET 
2015-08-05-00.41.40.711274PART_PRIVS  HIVET 
2015-08-05-00.08.48.542585ROLES   HIVET 
2015-08-05-00.08.57.810737ROLE_MAPHIVET 
2015-08-05-00.08.49.984015SDS HIVET 
2015-08-05-00.09.04.575646SD_PARAMS   HIVET 
2015-08-05-00.09.12.710014SEQUENCE_TABLE  HIVET 
2015-08-05-00.09.06.135560SERDES  HIVET

RE: Protobuf error when streaming from Kafka

2015-08-25 Thread java8964




I am not familiar with CDH distribution, we built spark ourselves.
The error means running code generated with Protocol-Buffers 2.5.0 with a 
protocol-buffers-2.4.1 (or earlier) jar.
So there is a protocol-buffer 2.4.1 version somewhere, either in the jar you 
built, or in the cluster runtime.
This shows a trick to identify which jar file the class is loaded from:
http://stackoverflow.com/questions/1983839/determine-which-jar-file-a-class-is-from
You may want to add the log in the first line of your code to check class 
com.google.protobuf.GeneratedMessage to see which jar file it is loaded from, 
and verify if it is in 2.5 version or below.

Yong
Date: Tue, 25 Aug 2015 13:44:17 -0700
Subject: Re: Protobuf error when streaming from Kafka
From: lcas...@gmail.com
To: java8...@hotmail.com
CC: yuzhih...@gmail.com; user@spark.apache.org

Do you think this binary would have issue? Do I need to build spark from source 
code?

On Tue, Aug 25, 2015 at 1:06 PM, Cassa L lcas...@gmail.com wrote:
I downloaded below binary version of spark.
spark-1.4.1-bin-cdh4

On Tue, Aug 25, 2015 at 1:03 PM, java8964 java8...@hotmail.com wrote:



Did your spark build with Hive?
I met the same problem before because the hive-exec jar in the maven itself 
include protobuf class, which will be included in the Spark jar.
Yong

Date: Tue, 25 Aug 2015 12:39:46 -0700
Subject: Re: Protobuf error when streaming from Kafka
From: lcas...@gmail.com
To: yuzhih...@gmail.com
CC: user@spark.apache.org

Hi,
 I am using Spark-1.4 and Kafka-0.8.2.1
As per google suggestions, I rebuilt all the classes with protobuff-2.5 
dependencies. My new protobuf is compiled using 2.5. However now, my spark job 
does not start. Its throwing different error. Does Spark or any other its 
dependencies uses old protobuff-2.4?

Exception in thread main java.lang.VerifyError: class 
com.apple.ist.retail.xcardmq.serializers.SampleProtobufMessage$ProtoBuff 
overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at 
com.apple.ist.retail.spark.kafka.consumer.SparkMQProcessor.start(SparkProcessor.java:68)
at 
com.apple.ist.retail.spark.kafka.consumer.SparkMQConsumer.main(SparkConsumer.java:43)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


On Mon, Aug 24, 2015 at 6:53 PM, Ted Yu yuzhih...@gmail.com wrote:
Can you show the complete stack trace ?
Which Spark / Kafka release are you using ?
Thanks
On Mon, Aug 24, 2015 at 4:58 PM, Cassa L lcas...@gmail.com wrote:
Hi,
 I am storing messages in Kafka using protobuf and reading them into Spark. I 
upgraded protobuf version from 2.4.1 to 2.5.0. I got 
java.lang.UnsupportedOperationException for older messages. However, even for 
new messages I get the same error. Spark does convert it though. I see my 
messages. How do I get rid of this error?
java.lang.UnsupportedOperationException: This is supposed to be overridden by 
subclasses.
at 
com.google.protobuf.GeneratedMessage.getUnknownFields(GeneratedMessage.java:180)
at 
org.apache.hadoop.hdfs.protocol.proto.HdfsProtos$FsPermissionProto.getSerializedSize(HdfsProtos.java:5407)
at 
com.google.protobuf.CodedOutputStream.computeMessageSizeNoTag(CodedOutputStream.java:749)

RE: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread java8964

I believe spark-shell -i scriptFile is there. We also use it, at least in 
Spark 1.3.1.
dse spark will just wrap spark-shell command, underline it is just invoking 
spark-shell.
I don't know too much about the original problem though.
Yong
Date: Fri, 21 Aug 2015 18:19:49 +0800
Subject: Re: Transformation not happening for reduceByKey or GroupByKey
From: zjf...@gmail.com
To: jsatishchan...@gmail.com
CC: robin.e...@xense.co.uk; user@spark.apache.org

Hi Satish,
I don't see where spark support -i, so suspect it is provided by DSE. In that 
case, it might be bug of DSE.

On Fri, Aug 21, 2015 at 6:02 PM, satish chandra j jsatishchan...@gmail.com 
wrote:
HI Robin,Yes, it is DSE but issue is related to Spark only
Regards,Satish Chandra
On Fri, Aug 21, 2015 at 3:06 PM, Robin East robin.e...@xense.co.uk wrote:
Not sure, never used dse - it’s part of DataStax Enterprise right?
On 21 Aug 2015, at 10:07, satish chandra j jsatishchan...@gmail.com wrote:
HI Robin,Yes, below mentioned piece or code works fine in Spark Shell but the 
same when place in Script File and executed with -i file name it creating an 
empty RDD
scala val pairs = sc.makeRDD(Seq((0,1),(0,2),(1,20),(1,30),(2,40)))pairs: 
org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[77] at makeRDD at 
console:28

scala pairs.reduceByKey((x,y) = x + y).collectres43: Array[(Int, Int)] = 
Array((0,3), (1,50), (2,40))
Command:
dse spark --master local --jars postgresql-9.4-1201.jar -i  ScriptFile

I understand, I am missing something here due to which my final RDD does not 
have as required output
Regards,Satish Chandra
On Thu, Aug 20, 2015 at 8:23 PM, Robin East robin.e...@xense.co.uk wrote:
This works for me:
scala val pairs = sc.makeRDD(Seq((0,1),(0,2),(1,20),(1,30),(2,40)))pairs: 
org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[77] at makeRDD at 
console:28

scala pairs.reduceByKey((x,y) = x + y).collectres43: Array[(Int, Int)] = 
Array((0,3), (1,50), (2,40))
On 20 Aug 2015, at 11:05, satish chandra j jsatishchan...@gmail.com wrote:
HI All,I have data in RDD as mentioned below:
RDD : Array[(Int),(Int)] = Array((0,1), (0,2),(1,20),(1,30),(2,40))

I am expecting output as Array((0,3),(1,50),(2,40)) just a sum function on 
Values for each key
Code:RDD.reduceByKey((x,y) = x+y)RDD.take(3)
Result in console:
RDD: org.apache.spark.rdd.RDD[(Int,Int)]= ShuffledRDD[1] at reduceByKey at 
console:73res:Array[(Int,Int)] = Array()
Command as mentioned

dse spark --master local --jars postgresql-9.4-1201.jar -i  ScriptFile

Please let me know what is missing in my code, as my resultant Array is empty

Regards,Satish

-- 
Best Regards

Jeff Zhang

RE: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread java8964

What version of Spark you are using, or comes with DSE 4.7?
We just cannot reproduce it in Spark.
yzhang@localhost$ more test.sparkval pairs = 
sc.makeRDD(Seq((0,1),(0,2),(1,20),(1,30),(2,40)))pairs.reduceByKey((x,y) = x + 
y).collectyzhang@localhost$ ~/spark/bin/spark-shell --master local -i 
test.sparkWelcome to    __ / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/   /___/ .__/\_,_/_/ /_/\_\   version 1.3.1  /_/
Using Scala version 2.10.4Spark context available as sc.SQL context available 
as sqlContext.Loading test.spark...pairs: org.apache.spark.rdd.RDD[(Int, Int)] 
= ParallelCollectionRDD[0] at makeRDD at console:2115/08/21 09:58:51 WARN 
SizeEstimator: Failed to check whether UseCompressedOops is set; assuming 
yesres0: Array[(Int, Int)] = Array((0,3), (1,50), (2,40))
Yong

Date: Fri, 21 Aug 2015 19:24:09 +0530
Subject: Re: Transformation not happening for reduceByKey or GroupByKey
From: jsatishchan...@gmail.com
To: abhis...@tetrationanalytics.com
CC: user@spark.apache.org

HI Abhishek,
I have even tried that but rdd2 is empty
Regards,Satish
On Fri, Aug 21, 2015 at 6:47 PM, Abhishek R. Singh 
abhis...@tetrationanalytics.com wrote:
You had:



 RDD.reduceByKey((x,y) = x+y)

 RDD.take(3)



Maybe try:



 rdd2 = RDD.reduceByKey((x,y) = x+y)

 rdd2.take(3)



-Abhishek-



On Aug 20, 2015, at 3:05 AM, satish chandra j jsatishchan...@gmail.com wrote:



 HI All,

 I have data in RDD as mentioned below:



 RDD : Array[(Int),(Int)] = Array((0,1), (0,2),(1,20),(1,30),(2,40))





 I am expecting output as Array((0,3),(1,50),(2,40)) just a sum function on 
 Values for each key



 Code:

 RDD.reduceByKey((x,y) = x+y)

 RDD.take(3)



 Result in console:

 RDD: org.apache.spark.rdd.RDD[(Int,Int)]= ShuffledRDD[1] at reduceByKey at 
 console:73

 res:Array[(Int,Int)] = Array()



 Command as mentioned



 dse spark --master local --jars postgresql-9.4-1201.jar -i  ScriptFile





 Please let me know what is missing in my code, as my resultant Array is empty







 Regards,

 Satish

How frequently should full gc we expect

2015-08-21 Thread java8964

In the test job I am running in Spark 1.3.1 in our stage cluster, I can see 
following information on the application stage information:
MetricMin25th percentileMedian75th percentileMaxDuration0 ms1.1 min1.5 min1.7 
min3.4 minGC Time11 s16 s21 s25 s54 s
From the GC output log, I can see it is about full GC in the executor every 
minutes, like below.
My question is that the committed heap is more than 14G, and 
-XX:MaxPermSize=128m, in this case, the heap usage max is about 10G, why full 
GC happened every minute?
The job runs fine, just want to know what exception you guys normally have for 
full GC in the spark jobs?
Thanks
Yong
2015-08-21T16:53:59.561-0400: [Full GC [PSYoungGen: 328038K-0K(3728384K)] 
[ParOldGen: 10359817K-5856671K(11185152K)] 10687855K-5856671K(14913536K) 
[PSPermGen: 57214K-57214K(57856K)], 8.6951450 secs] [Times: user=140.72 
sys=0.18, real=8.69 secs] 
2015-08-21T16:54:09.605-0400: [GC [PSYoungGen: 1864192K-251539K(3728384K)] 
7720863K-6108211K(14913536K), 0.1217750 secs] [Times: user=2.12 sys=0.01, 
real=0.12 secs] 
2015-08-21T16:54:11.131-0400: [GC [PSYoungGen: 2115731K-163448K(3728384K)] 
7972404K-6197142K(14913536K), 0.1802910 secs] [Times: user=3.19 sys=0.01, 
real=0.18 secs] 
2015-08-21T16:54:12.832-0400: [GC [PSYoungGen: 2027640K-144369K(3728384K)] 
8061339K-6314232K(14913536K), 0.1816010 secs] [Times: user=3.03 sys=0.00, 
real=0.19 secs] 
2015-08-21T16:54:14.547-0400: [GC [PSYoungGen: 2008561K-121478K(3728384K)] 
8178424K-6435609K(14913536K), 0.1411160 secs] [Times: user=2.50 sys=0.00, 
real=0.14 secs] 
2015-08-21T16:54:15.931-0400: [GC [PSYoungGen: 1985670K-114489K(3728384K)] 
8299801K-6550508K(14913536K), 0.1285300 secs] [Times: user=2.13 sys=0.00, 
real=0.13 secs] 
2015-08-21T16:54:17.323-0400: [GC [PSYoungGen: 1978681K-219811K(3792896K)] 
8414700K-6769504K(14978048K), 0.1649230 secs] [Times: user=2.89 sys=0.01, 
real=0.17 secs] 
2015-08-21T16:54:18.878-0400: [GC [PSYoungGen: 2148515K-425173K(3728384K)] 
8698218K-6974876K(14913536K), 0.3130360 secs] [Times: user=5.56 sys=0.00, 
real=0.31 secs] 
2015-08-21T16:54:20.596-0400: [GC [PSYoungGen: 2353877K-313071K(3985408K)] 
8903582K-7071556K(15170560K), 0.2423240 secs] [Times: user=4.30 sys=0.00, 
real=0.24 secs] 
2015-08-21T16:54:22.695-0400: [GC [PSYoungGen: 2608367K-371370K(3902464K)] 
9366852K-7338548K(15087616K), 0.2647510 secs] [Times: user=4.48 sys=0.00, 
real=0.26 secs] 
2015-08-21T16:54:24.747-0400: [GC [PSYoungGen: 266K-459392K(4174336K)] 
9633844K-7528652K(15359488K), 0.3564370 secs] [Times: user=6.36 sys=0.00, 
real=0.35 secs] 
2015-08-21T16:54:26.951-0400: [GC [PSYoungGen: 3116160K-445880K(4075008K)] 
10185420K-7746897K(15260160K), 0.2853880 secs] [Times: user=5.07 sys=0.00, 
real=0.29 secs] 
2015-08-21T16:54:29.340-0400: [GC [PSYoungGen: 3102648K-286176K(4314112K)] 
10403665K-7809242K(15499264K), 0.2534940 secs] [Times: user=4.48 sys=0.01, 
real=0.25 secs] 
2015-08-21T16:54:31.979-0400: [GC [PSYoungGen: 3269600K-122064K(4261888K)] 
10792666K-7863493K(15447040K), 0.2035800 secs] [Times: user=3.41 sys=0.00, 
real=0.20 secs] 
2015-08-21T16:54:34.737-0400: [GC [PSYoungGen: 3105488K-555850K(4373504K)] 
10846917K-8297279K(15558656K), 0.2401510 secs] [Times: user=4.14 sys=0.00, 
real=0.24 secs] 
2015-08-21T16:54:38.015-0400: [GC [PSYoungGen: 3675978K-1146062K(4266496K)] 
11417409K-8887493K(15451648K), 0.4298600 secs] [Times: user=7.65 sys=0.00, 
real=0.43 secs] 
2015-08-21T16:54:41.492-0400: [GC [PSYoungGen: 4266190K-1326063K(3565056K)] 
12007627K-9231644K(14750208K), 0.5542100 secs] [Times: user=9.90 sys=0.01, 
real=0.55 secs] 
2015-08-21T16:54:43.797-0400: [GC [PSYoungGen: 3565039K-1587981K(3827200K)] 
11470620K-9612725K(15012352K), 0.5359080 secs] [Times: user=9.57 sys=0.00, 
real=0.54 secs] 
2015-08-21T16:54:45.856-0400: [GC [PSYoungGen: 3826957K-1047737K(3629568K)] 
11851701K-9914434K(14814720K), 0.7787060 secs] [Times: user=13.91 sys=0.00, 
real=0.78 secs] 
2015-08-21T16:54:48.174-0400: [GC [PSYoungGen: 2911929K-459808K(3728384K)] 
11778626K-10058483K(14913536K), 0.5953360 secs] [Times: user=10.62 sys=0.03, 
real=0.60 secs] 
2015-08-21T16:54:50.217-0400: [GC [PSYoungGen: 2324000K-102928K(3740160K)] 
11922675K-10159967K(14925312K), 0.3191560 secs] [Times: user=5.68 sys=0.01, 
real=0.32 secs] 
2015-08-21T16:54:51.951-0400: [GC [PSYoungGen: 1978896K-296227K(3728384K)] 
12035935K-10456136K(14913536K), 0.1809970 secs] [Times: user=3.02 sys=0.00, 
real=0.18 secs] 
2015-08-21T16:54:53.550-0400: [GC [PSYoungGen: 2172195K-316636K(3866624K)] 
12332104K-10720591K(15051776K), 0.2545970 secs] [Times: user=4.43 sys=0.00, 
real=0.25 secs] 
2015-08-21T16:54:55.340-0400: [GC [PSYoungGen: 2390748K-336907K(3800064K)] 
12794703K-11043658K(14985216K), 0.3550330 secs] [Times: user=6.28 sys=0.00, 
real=0.35 secs] 
2015-08-21T16:54:55.695-0400: [Full GC [PSYoungGen: 336907K-0K(3800064K)] 
[ParOldGen: 10706750K-5725402K(11185152K)] 11043658K-5725402K(14985216K) 
[PSPermGen: 57214K-57214K(57856K)], 9.5623960 secs] [Times: user=150.15

RE: Any suggestion about sendMessageReliably failed because ack was not received within 120 sec

2015-08-20 Thread java8964

The closed information I can found online related to this error 
ishttps://issues.apache.org/jira/browse/SPARK-3633
But it is quite different in our case. In our case, we never saw the (Too many 
open files) error, the log just simple show the 120 sec time out.
I checked all the GC output from all 42 executors, the max full gc real=11.79 
secs is what I can find, way less than 120 seconds time out.
From 42 executors, there is on executor's stdout/stderr page hangs, I cannot 
see any gc or log information for this executor, but it is shown as LOADING 
in the master page, and I think the reason is just the WorkerUI cannot bind 
to 8081 somehow during the boot time, and bind to 8082 instead, master UI 
didn't catch that information.
Anyway, my only option now is to increase the timeout of both 
spark.core.connection.ack.wait.timeout and spark.akka.timeout to 600, as 
suggested in the jira, and will report back what I find later.
This same daily job runs about 12 hours in the Hive/MR, and can finish about 4 
hours in Spark (with 25% allocated cluster resource). On this point, Spark is 
faster and great, but IF (big IF) every tasks run smoothly.
In Hive/MR, if the job is setup, it will finish, maybe slow, but smoothly. In 
Spark, in this case, it does retry the failed partitions only, but we saw 4 or 
5 times retry sometimes, make it in fact much much slower.
Yong
From: java8...@hotmail.com
To: user@spark.apache.org
Subject: Any suggestion about sendMessageReliably failed because ack was not 
received within 120 sec
Date: Thu, 20 Aug 2015 20:49:52 -0400




Hi, Sparkers:
After first 2 weeks of Spark in our production cluster, with more familiar with 
Spark, we are more confident to avoid Lost Executor due to memory issue. So 
far, most of our jobs won't fail or slow down due to Lost executor.
But sometimes, I observed that individual tasks failed due to 
sendMessageReliably failed because ack was not received within 120 sec. 
Here is the basic information:
Spark 1.3.1 in 1 master + 42 worker boxes in standalone deploymentThe cluster 
also runs Hadoop + MapReduce, so we allocate about 25% resource to Spark. We 
are conservative for the Spark jobs, with low number of cores  + big 
parallelism/partitions to control the memory usage in the job, so far we are 
happen to avoid lost executor.
We have one big daily job is running with following configuration:
/opt/spark/bin/spark-shell --jars spark-avro.jar --conf spark.ui.port=4042 
--executor-memory 20G --total-executor-cores 168 --conf 
spark.storage.memoryFraction=0.1 --conf spark.sql.shuffle.partitions=6000 
--conf spark.default.parallelism=6000 --conf 
spark.shuffle.blockTransferService=nio -i spark.script
168 cores will make each executor run with 4 thread (168 / 42 = 4)There is no 
cache needed, so I make the storage memoryFraction very lownio is much robust 
than netty in our experience
For this big daily job generating over 2 of tasks, they all could finish 
without this issue, but sometimes, for the same job, tasks keep failing due to 
this error and retry.
But even in this case, I saw the task failed due to this error and retry. Retry 
maybe part of life for distribute environment, but I want to know what root 
cause could behind it and how to avoid it.
Do I increase spark.core.connection.ack.wait.timeout to fix this error? When 
this happened, I saw there is no executor lost, all are alive. 
Below is the message in the log, for example, it complained about timeout to 
connect to host-121.
FetchFailed(BlockManagerId(31, host-121, 38930), shuffleId=3, mapId=17, 
reduceId=2577, message=org.apache.spark.shuffle.FetchFailedException: 
sendMessageReliably failed because ack was not received within 120 sec  at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
  at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
  at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)  at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)  
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)  
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)  at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)  at 
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)  at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)  at 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:154)
  at 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:149)
  at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)  at 
org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)  at

Any suggestion about sendMessageReliably failed because ack was not received within 120 sec

2015-08-20 Thread java8964

Hi, Sparkers:
After first 2 weeks of Spark in our production cluster, with more familiar with 
Spark, we are more confident to avoid Lost Executor due to memory issue. So 
far, most of our jobs won't fail or slow down due to Lost executor.
But sometimes, I observed that individual tasks failed due to 
sendMessageReliably failed because ack was not received within 120 sec. 
Here is the basic information:
Spark 1.3.1 in 1 master + 42 worker boxes in standalone deploymentThe cluster 
also runs Hadoop + MapReduce, so we allocate about 25% resource to Spark. We 
are conservative for the Spark jobs, with low number of cores  + big 
parallelism/partitions to control the memory usage in the job, so far we are 
happen to avoid lost executor.
We have one big daily job is running with following configuration:
/opt/spark/bin/spark-shell --jars spark-avro.jar --conf spark.ui.port=4042 
--executor-memory 20G --total-executor-cores 168 --conf 
spark.storage.memoryFraction=0.1 --conf spark.sql.shuffle.partitions=6000 
--conf spark.default.parallelism=6000 --conf 
spark.shuffle.blockTransferService=nio -i spark.script
168 cores will make each executor run with 4 thread (168 / 42 = 4)There is no 
cache needed, so I make the storage memoryFraction very lownio is much robust 
than netty in our experience
For this big daily job generating over 2 of tasks, they all could finish 
without this issue, but sometimes, for the same job, tasks keep failing due to 
this error and retry.
But even in this case, I saw the task failed due to this error and retry. Retry 
maybe part of life for distribute environment, but I want to know what root 
cause could behind it and how to avoid it.
Do I increase spark.core.connection.ack.wait.timeout to fix this error? When 
this happened, I saw there is no executor lost, all are alive. 
Below is the message in the log, for example, it complained about timeout to 
connect to host-121.
FetchFailed(BlockManagerId(31, host-121, 38930), shuffleId=3, mapId=17, 
reduceId=2577, message=org.apache.spark.shuffle.FetchFailedException: 
sendMessageReliably failed because ack was not received within 120 sec  at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
  at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
  at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)  at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)  
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)  
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)  at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)  at 
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)  at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)  at 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:154)
  at 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:149)
  at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)  at 
org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)  at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)  at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:244)  at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)  at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:244)  at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)  at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)  at 
org.apache.spark.scheduler.Task.run(Task.scala:64)  at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
 at java.lang.Thread.run(Thread.java:745)Caused by: java.io.IOException: 
sendMessageReliably failed because ack was not received within 120 sec  at 
org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:929)
  at 
org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:928)
  at scala.Option.foreach(Option.scala:236)  at 
org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:928)
  at 
io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
  at 
io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)
  at

RE: Failed to fetch block error

2015-08-19 Thread java8964

From the log, it looks like the OS user who is running spark cannot open any 
more file.
Check your ulimit setting for that user:
ulimit -aopen files  (-n) 65536

 Date: Tue, 18 Aug 2015 22:06:04 -0700
 From: swethakasire...@gmail.com
 To: user@spark.apache.org
 Subject: Failed to fetch block  error
 
 Hi,
 
 I see the following error in my Spark Job even after using like 100 cores
 and 16G memory. Did any of you experience the same problem earlier?
 
 15/08/18 21:51:23 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block
 input-0-1439959114400, and will not retry (0 retries)
 java.lang.RuntimeException: java.io.FileNotFoundException:
 /data1/spark/spark-aed30958-2ee1-4eb7-984e-6402fb0a0503/blockmgr-ded36b52-ccc7-48dc-ba05-65bb21fc4136/34/input-0-1439959114400
 (Too many open files)
   at java.io.RandomAccessFile.open(Native Method)
   at java.io.RandomAccessFile.init(RandomAccessFile.java:241)
   at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:110)
   at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134)
   at 
 org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:511)
   at
 org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:302)
   at
 org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57)
   at
 org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57)
   at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Failed-to-fetch-block-error-tp24335.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

RE: Spark Job Hangs on our production cluster

2015-08-18 Thread java8964

Hi, Imran:
Thanks for your reply. I am not sure what do you mean repl. Can you be more 
detail about that?
This is only happened when the Spark 1.2.2 try to scan big data set, and cannot 
reproduce if it scans smaller dataset.
FYI, I have to build and deploy Spark 1.3.1 on our production cluster. Right 
now, I cannot reproduce this hang problem on the same cluster for the same big 
dataset. On this point, we will continue trying Spark 1.3.1, hope we will have 
more positive experience with it.
But just for wondering, what class Spark needs to be loaded at this time? From 
my understanding, the executor already scan the first block data from HDFS, and 
hanging while starting the 2nd block. All the class should be already loaded in 
JVM in this case.
Thanks
Yong
From: iras...@cloudera.com
Date: Tue, 18 Aug 2015 12:17:56 -0500
Subject: Re: Spark Job Hangs on our production cluster
To: java8...@hotmail.com
CC: user@spark.apache.org

just looking at the thread dump from your original email, the 3 executor 
threads are all trying to load classes.  (One thread is actually loading some 
class, and the others are blocked waiting to load a class, most likely trying 
to load the same thing.)  That is really weird, definitely not something which 
should keep things blocked for 30 min.  It suggest something wrong w/ the jvm, 
or classpath configuration, or a combination.  Looks like you are trying to run 
in the repl, and for whatever reason the http server for the repl to serve 
classes is not responsive.  I'd try running outside of the repl and see if that 
works.
sorry not a full diagnosis but maybe this'll help a bit.
On Tue, Aug 11, 2015 at 3:19 PM, java8964 java8...@hotmail.com wrote:



Currently we have a IBM BigInsight cluster with 1 namenode + 1 JobTracker + 42 
data/task nodes, which runs with BigInsight V3.0.0.2, corresponding with Hadoop 
2.2.0 with MR1.
Since IBM BigInsight doesn't come with Spark, so we build Spark 1.2.2 with 
Hadoop 2.2.0 + Hive 0.12 by ourselves, and deploy it on the same cluster.
The IBM Biginsight comes with IBM jdk 1.7, but during our experience on stage 
environment, we found out Spark works better with Oracle JVM. So we run spark 
under Oracle JDK 1.7.0_79.
Now on production, we are facing a issue we never faced, nor can reproduce on 
our staging cluster. 
We are using Spark Standalone cluster, and here is the basic configurations:
more spark-env.shexport JAVA_HOME=/opt/javaexport 
PATH=$JAVA_HOME/bin:$PATHexport 
HADOOP_CONF_DIR=/opt/ibm/biginsights/hadoop-conf/export 
SPARK_CLASSPATH=/opt/ibm/biginsights/IHC/lib/ibm-compression.jar:/opt/ibm/biginsights/hive/lib/db2jcc4-10.6.jarexport
 
SPARK_LOCAL_DIRS=/data1/spark/local,/data2/spark/local,/data3/spark/localexport 
SPARK_MASTER_WEBUI_PORT=8081export SPARK_MASTER_IP=host1export 
SPARK_MASTER_OPTS=-Dspark.deploy.defaultCores=42export 
SPARK_WORKER_MEMORY=24gexport SPARK_WORKER_CORES=6export 
SPARK_WORKER_DIR=/tmp/spark/workexport SPARK_DRIVER_MEMORY=2gexport 
SPARK_EXECUTOR_MEMORY=2g
more spark-defaults.confspark.master
spark://host1:7077spark.eventLog.enabledtruespark.eventLog.dir  
hdfs://host1:9000/spark/eventLogspark.serializer
org.apache.spark.serializer.KryoSerializerspark.executor.extraJavaOptions   
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
We are using AVRO file format a lot, and we have these 2 datasets, one is about 
96G, and the other one is a little over 1T. Since we are using AVRO, so we also 
built spark-avro of commit a788c9fce51b0ec1bb4ce88dc65c1d55aaa675b8, which is 
the latest version supporting Spark 1.2.x.
Here is the problem we are facing on our production cluster, even the following 
simple spark-shell commands will hang in our production cluster:
import org.apache.spark.sql.SQLContextval sqlContext = new 
org.apache.spark.sql.hive.HiveContext(sc)import com.databricks.spark.avro._val 
bigData = 
sqlContext.avroFile(hdfs://namenode:9000/bigData/)bigData.registerTempTable(bigData)bigData.count()
From the console,  we saw following:[Stage 0: 
  (44 + 42) / 7800]
no update for more than 30 minutes and longer.
The big dataset with 1T should generate 7800 HDFS block, but Spark's HDFS 
client looks like having problem to read them. Since we are running Spark on 
the data nodes, all the Spark tasks are running as NODE_LOCAL on locality 
level.
If I go to the data/task node which Spark tasks hang, and use the JStack to 
dump the thread, I got the following on the top:
015-08-11 15:38:38Full thread dump Java HotSpot(TM) 64-Bit Server VM (24.79-b02 
mixed mode):
Attach Listener daemon prio=10 tid=0x7f0660589000 nid=0x1584d waiting on 
condition [0x]   java.lang.Thread.State: RUNNABLE
org.apache.hadoop.hdfs.PeerCache@4a88ec00 daemon prio=10 
tid=0x7f06508b7800 nid=0x13302 waiting on condition [0x7f060be94000]   
java.lang.Thread.State: TIMED_WAITING

Spark Job Hangs on our production cluster

2015-08-17 Thread java8964

I am comparing the log of Spark line by line between the hanging case (big 
dataset) and not hanging case (small dataset). 
In the hanging case, the Spark's log looks identical with not hanging case for 
reading the first block data from the HDFS.
But after that, starting from line 438 in the spark-hang.log, I only see the 
log generated from Worker, like following in the next 10 minutes:
15/08/14 14:24:19 DEBUG Worker: [actor] received message SendHeartbeat from 
Actor[akka://sparkWorker/user/Worker#90699948]15/08/14 14:24:19 DEBUG Worker: 
[actor] handled message (0.121965 ms) SendHeartbeat from 
Actor[akka://sparkWorker/user/Worker#90699948]...15/08/14
 14:33:04 DEBUG Worker: [actor] received message SendHeartbeat from 
Actor[akka://sparkWorker/user/Worker#90699948]15/08/14 14:33:04 DEBUG Worker: 
[actor] handled message (0.136146 ms) SendHeartbeat from 
Actor[akka://sparkWorker/user/Worker#90699948]
until almost 10 minutes I have to kill the job. I know it will hang forever.
But in the good log (spark-finished.log), starting from the line 361, Spark 
started to read the 2nd split data, I can see all the debug message from 
BlockReaderLocal, BlockManger.
If I compared between these 2 cases log:
in the good log case from line 478, I can saw this message:15/08/14 14:37:09 
DEBUG BlockReaderLocal: putting FileInputStream for ..
But in the hang log case for reading the 2nd split data, I don't see this 
message any more (It existed for the 1st split). I believe in this case, this 
log message should show up, as the 2nd split block also existed on this Spark 
node, as just before it, I can see the following debug message:
15/08/14 14:24:11 DEBUG BlockReaderLocal: Created BlockReaderLocal for file 
/services/contact2/data/contacts/20150814004805-part-r-2.avro block 
BP-834217708-10.20.95.130-1438701195738:blk_1074484553_1099531839081 in 
datanode 10.20.95.146:5001015/08/14 14:24:11 DEBUG Project: Creating 
MutableProj: WrappedArray(), inputSchema: ArrayBuffer(account_id#0L, 
contact_id#1, sequence_id#2, state#3, name#4, kind#5, prefix_name#6, 
first_name#7, middle_name#8, company_name#9, job_title#10, source_name#11, 
source_details#12, provider_name#13, provider_details#14, created_at#15L, 
create_source#16, updated_at#17L, update_source#18, accessed_at#19L, 
deleted_at#20L, delta#21, birthday_day#22, birthday_month#23, anniversary#24L, 
contact_fields#25, related_contacts#26, contact_channels#27, 
contact_notes#28, contact_service_addresses#29, contact_street_addresses#30), 
codegen:false
This log is generated on node (10.20.95.146), and Spark created 
BlockReaderLocal to read the data from the local node.
Now my question is, can someone give me any idea why DEBUG BlockReaderLocal: 
putting FileInputStream for  doesn't show up any more in this case?
I attached the log files again in this email, and really hope I can get some 
help from this list.
Thanks
Yong
From: java8...@hotmail.com
To: user@spark.apache.org
Subject: RE: Spark Job Hangs on our production cluster
Date: Fri, 14 Aug 2015 15:14:10 -0400




I still want to check if anyone can provide any help related to the Spark 1.2.2 
will hang on our production cluster when reading Big HDFS data (7800 avro 
blocks), while looks fine for small data (769 avro blocks).
I enable the debug level in the spark log4j, and attached the log file if it 
helps to trouble shooting in this case.
Summary of our cluster:
IBM BigInsight V3.0.0.2 (running with Hadoop 2.2.0 + Hive 0.12)42 Data nodes, 
each one is running HDFS data node process + task tracker + spark workerOne 
master, running HDFS Name node + Spark masterAnother master node, running 2nd 
Name node + JobTracker
The test cases I did are 2, using very simple spark shell to read 2 folders, 
one is big data with 1T avro files; another one is small data with 160G avro 
files.
The avro files schema of 2 folders are different, but I don't think that will 
make any difference here.
The test script is like following:
import org.apache.spark.sql.SQLContextval sqlContext = new 
org.apache.spark.sql.hive.HiveContext(sc)import com.databricks.spark.avro._val 
testdata = sqlContext.avroFile(hdfs://namenode:9000/bigdata_folder)   // vs 
sqlContext.avroFile(hdfs://namenode:9000/smalldata_folder)testdata.registerTempTable(testdata)testdata.count()
Both cases are kicking off as the same following:/opt/spark/bin/spark-shell 
--jars /opt/ibm/cclib/spark-avro.jar --conf spark.ui.port=4042 
--executor-memory 24G --total-executor-cores 42 --conf 
spark.storage.memoryFraction=0.1 --conf spark.sql.shuffle.partitions=2000 
--conf spark.default.parallelism=2000
When the script point to the small data folder, the Spark can finish very fast. 
Each task of scanning the HDFS block can finish within 30 seconds or less.
When the script point to the big data folder, most of the nodes can finish scan 
the first block of HDFS within 2 mins (longer than case 1), then the scanning 
will

RE: Spark Job Hangs on our production cluster

2015-08-14 Thread java8964

I still want to check if anyone can provide any help related to the Spark 1.2.2 
will hang on our production cluster when reading Big HDFS data (7800 avro 
blocks), while looks fine for small data (769 avro blocks).
I enable the debug level in the spark log4j, and attached the log file if it 
helps to trouble shooting in this case.
Summary of our cluster:
IBM BigInsight V3.0.0.2 (running with Hadoop 2.2.0 + Hive 0.12)42 Data nodes, 
each one is running HDFS data node process + task tracker + spark workerOne 
master, running HDFS Name node + Spark masterAnother master node, running 2nd 
Name node + JobTracker
The test cases I did are 2, using very simple spark shell to read 2 folders, 
one is big data with 1T avro files; another one is small data with 160G avro 
files.
The avro files schema of 2 folders are different, but I don't think that will 
make any difference here.
The test script is like following:
import org.apache.spark.sql.SQLContextval sqlContext = new 
org.apache.spark.sql.hive.HiveContext(sc)import com.databricks.spark.avro._val 
testdata = sqlContext.avroFile(hdfs://namenode:9000/bigdata_folder)   // vs 
sqlContext.avroFile(hdfs://namenode:9000/smalldata_folder)testdata.registerTempTable(testdata)testdata.count()
Both cases are kicking off as the same following:/opt/spark/bin/spark-shell 
--jars /opt/ibm/cclib/spark-avro.jar --conf spark.ui.port=4042 
--executor-memory 24G --total-executor-cores 42 --conf 
spark.storage.memoryFraction=0.1 --conf spark.sql.shuffle.partitions=2000 
--conf spark.default.parallelism=2000
When the script point to the small data folder, the Spark can finish very fast. 
Each task of scanning the HDFS block can finish within 30 seconds or less.
When the script point to the big data folder, most of the nodes can finish scan 
the first block of HDFS within 2 mins (longer than case 1), then the scanning 
will hang, across all the nodes in the cluster, which means no task can 
continue any more. The whole job will hang until I have to killed it.
There are logs attached in this email, and here is what I can read from the log 
files:
1) Spark-finished.log, which is the log generated from Spark in good case.
In this case, it is clear there is a loop to read the data from the HDFS, 
looping like:15/08/14 14:38:05 INFO HadoopRDD: Input split:15/08/14 
14:37:40 DEBUG Client: IPC Client (370155726) connection to 
p2-bigin101/10.20.95.130:9000 from15/08/14 14:37:40 DEBUG 
ProtobufRpcEngine: Call: getBlockLocations took 2ms15/08/14 14:38:32 INFO 
HadoopRDD: Input split:
 There are exception in it, like: java.lang.NoSuchMethodException: 
org.apache.hadoop.fs.FileSystem$Statistics.getThreadStatistics()at 
java.lang.Class.getDeclaredMethod(Class.java:2009)at 
org.apache.spark.util.Utils$.invoke(Utils.scala:1827)at 
org.apache.spark.deploy.SparkHadoopUtil$$anonfun$getFileSystemThreadStatistics$1.apply(SparkHadoopUtil.scala:179)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anonfun$getFileSystemThreadStatistics$1.apply(SparkHadoopUtil.scala:179)
But doesn't affect the function and didn't fail the job.
2) Spark-hang.log, which is from the same node generated from Spark in the hang 
case:In this case, it looks like Spark can read the data from HDFS first 
time, as the log looked same as the good case log., but after that, only the 
following DEBUG message output: 15/08/14 14:24:19 DEBUG Worker: [actor] 
received message SendHeartbeat from 
Actor[akka://sparkWorker/user/Worker#90699948]15/08/14 14:24:19 DEBUG Worker: 
[actor] handled message (0.121965 ms) SendHeartbeat from 
Actor[akka://sparkWorker/user/Worker#90699948]15/08/14 14:24:34 DEBUG Worker: 
[actor] received message SendHeartbeat from 
Actor[akka://sparkWorker/user/Worker#90699948]15/08/14 14:24:34 DEBUG Worker: 
[actor] handled message (0.135455 ms) SendHeartbeat from 
Actor[akka://sparkWorker/user/Worker#90699948]15/08/14 14:24:49 DEBUG Worker: 
[actor] received message SendHeartbeat from 
Actor[akka://sparkWorker/user/Worker#90699948]
There is no more connecting to datanode message, until after 10 minus, I have 
to just kill the executor.
While in this 10 minutes, I did 2 times of jstack of the Spark java 
processor, trying to find out what thread is being blocked, attached as 
2698306-1.log and 2698306-2.log, as 2698306 is the pid.
Can some one give me any hint about what could be the root reason of this? 
While the spark is hanging to read the big dataset, the HDFS is health, as I 
can get/put the data in HDFS, and also the MR job running at same time continue 
without any problems.
I am thinking to generate a 1T text files folder to test Spark in this cluster, 
as I want to rule out any problem could related to AVRO, but it will take a 
while for me to generate that. But I am not sure if AVRO format could be the 
cause.
Thanks for your help.
Yong
From: java8...@hotmail.com
To: user@spark.apache.org
Subject: Spark Job Hangs on our production cluster
Date:

Spark 1.2.2 build problem with Hive 0.12, bringing in wrong version of avro-mapred

2015-08-12 Thread java8964

Hi, This email is sent to both dev and user list, just want to see if someone 
familiar with Spark/Maven build procedure can provide any help.
I am building Spark 1.2.2 with the following command:
mvn -Phadoop-2.2 -Dhadoop.version=2.2.0 -Phive -Phive-0.12.0
The spark-assembly-1.2.2-hadoop2.2.0.jar contains the avro and avro-ipc of 
version 1.7.6, but avro-mapred of version 1.7.1, which caused some wired 
runtime exception when I tried to read the avro file in the Spark 1.2.2, as 
following:
NullPointerExceptionat java.io.StringReader.init(StringReader.java:50)
at org.apache.avro.Schema$Parser.parse(Schema.java:943) at 
org.apache.avro.Schema.parse(Schema.java:992)at 
org.apache.avro.mapred.AvroJob.getInputSchema(AvroJob.java:65)   at 
org.apache.avro.mapred.AvroRecordReader.init(AvroRecordReader.java:43) at 
org.apache.avro.mapred.AvroInputFormat.getRecordReader(AvroInputFormat.java:52) 
 at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:233)   at 
org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:210)
So I run the following command to understand that avro-mapred 1.7.1 is brought 
in by Hive 0.12 profile:
mvn -Phadoop-2.2 -Dhadoop.version=2.2.0 -Phive -Phive-0.12.0 dependency:tree 
-Dverbose -Dincludes=org.apache.avro
[INFO] 
[INFO] 
Building Spark Project Hive 1.2.2[INFO] 
[INFO][INFO]
 --- maven-dependency-plugin:2.4:tree (default-cli) @ spark-hive_2.10 ---[INFO] 
org.apache.spark:spark-hive_2.10:jar:1.2.2[INFO] +- 
org.apache.spark:spark-core_2.10:jar:1.2.2:compile[INFO] |  \- 
org.apache.hadoop:hadoop-client:jar:2.2.0:compile (version managed from 
1.0.4)[INFO] | \- org.apache.hadoop:hadoop-common:jar:2.2.0:compile[INFO] | 
   \- (org.apache.avro:avro:jar:1.7.6:compile - version managed from 1.7.1; 
omitted for duplicate)[INFO] +- 
org.spark-project.hive:hive-serde:jar:0.12.0-protobuf-2.5:compile[INFO] |  +- 
(org.apache.avro:avro:jar:1.7.6:compile - version managed from 1.7.1; omitted 
for duplicate)[INFO] |  \- org.apache.avro:avro-mapred:jar:1.7.1:compile[INFO] 
| \- (org.apache.avro:avro-ipc:jar:1.7.6:compile - version managed from 
1.7.1; omitted for duplicate)[INFO] +- 
org.apache.avro:avro:jar:1.7.6:compile[INFO] \- 
org.apache.avro:avro-mapred:jar:hadoop2:1.7.6:compile[INFO]+- 
org.apache.avro:avro-ipc:jar:1.7.6:compile[INFO]|  \- 
(org.apache.avro:avro:jar:1.7.6:compile - version managed from 1.7.1; omitted 
for duplicate)[INFO]\- 
org.apache.avro:avro-ipc:jar:tests:1.7.6:compile[INFO]   \- 
(org.apache.avro:avro:jar:1.7.6:compile - version managed from 1.7.1; omitted 
for duplicate)[INFO]
In this case, I could manually fix all the classes in the final jar, changing 
from avro-mapred 1.7.1 to 1.7.6, but I wonder if there is any other solution, 
as this way is very error-prone.
Also, just from the above message, I can see avro-mapred.jar.hadoop2:1.7.6 
dependency is there, but looks like it is being omitted. Not sure why Maven 
choosed the lower version, as I am not a Maven guru.
My question, under the above situation, do I have a easy way to build it with 
avro-mapred 1.7.6, instead of 1.7.1?
Thanks
Yong

Spark Job Hangs on our production cluster

2015-08-11 Thread java8964

Currently we have a IBM BigInsight cluster with 1 namenode + 1 JobTracker + 42 
data/task nodes, which runs with BigInsight V3.0.0.2, corresponding with Hadoop 
2.2.0 with MR1.
Since IBM BigInsight doesn't come with Spark, so we build Spark 1.2.2 with 
Hadoop 2.2.0 + Hive 0.12 by ourselves, and deploy it on the same cluster.
The IBM Biginsight comes with IBM jdk 1.7, but during our experience on stage 
environment, we found out Spark works better with Oracle JVM. So we run spark 
under Oracle JDK 1.7.0_79.
Now on production, we are facing a issue we never faced, nor can reproduce on 
our staging cluster. 
We are using Spark Standalone cluster, and here is the basic configurations:
more spark-env.shexport JAVA_HOME=/opt/javaexport 
PATH=$JAVA_HOME/bin:$PATHexport 
HADOOP_CONF_DIR=/opt/ibm/biginsights/hadoop-conf/export 
SPARK_CLASSPATH=/opt/ibm/biginsights/IHC/lib/ibm-compression.jar:/opt/ibm/biginsights/hive/lib/db2jcc4-10.6.jarexport
 
SPARK_LOCAL_DIRS=/data1/spark/local,/data2/spark/local,/data3/spark/localexport 
SPARK_MASTER_WEBUI_PORT=8081export SPARK_MASTER_IP=host1export 
SPARK_MASTER_OPTS=-Dspark.deploy.defaultCores=42export 
SPARK_WORKER_MEMORY=24gexport SPARK_WORKER_CORES=6export 
SPARK_WORKER_DIR=/tmp/spark/workexport SPARK_DRIVER_MEMORY=2gexport 
SPARK_EXECUTOR_MEMORY=2g
more spark-defaults.confspark.master
spark://host1:7077spark.eventLog.enabledtruespark.eventLog.dir  
hdfs://host1:9000/spark/eventLogspark.serializer
org.apache.spark.serializer.KryoSerializerspark.executor.extraJavaOptions   
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
We are using AVRO file format a lot, and we have these 2 datasets, one is about 
96G, and the other one is a little over 1T. Since we are using AVRO, so we also 
built spark-avro of commit a788c9fce51b0ec1bb4ce88dc65c1d55aaa675b8, which is 
the latest version supporting Spark 1.2.x.
Here is the problem we are facing on our production cluster, even the following 
simple spark-shell commands will hang in our production cluster:
import org.apache.spark.sql.SQLContextval sqlContext = new 
org.apache.spark.sql.hive.HiveContext(sc)import com.databricks.spark.avro._val 
bigData = 
sqlContext.avroFile(hdfs://namenode:9000/bigData/)bigData.registerTempTable(bigData)bigData.count()
From the console,  we saw following:[Stage 0: 
  (44 + 42) / 7800]
no update for more than 30 minutes and longer.
The big dataset with 1T should generate 7800 HDFS block, but Spark's HDFS 
client looks like having problem to read them. Since we are running Spark on 
the data nodes, all the Spark tasks are running as NODE_LOCAL on locality 
level.
If I go to the data/task node which Spark tasks hang, and use the JStack to 
dump the thread, I got the following on the top:
015-08-11 15:38:38Full thread dump Java HotSpot(TM) 64-Bit Server VM (24.79-b02 
mixed mode):
Attach Listener daemon prio=10 tid=0x7f0660589000 nid=0x1584d waiting on 
condition [0x]   java.lang.Thread.State: RUNNABLE
org.apache.hadoop.hdfs.PeerCache@4a88ec00 daemon prio=10 
tid=0x7f06508b7800 nid=0x13302 waiting on condition [0x7f060be94000]   
java.lang.Thread.State: TIMED_WAITING (sleeping)at 
java.lang.Thread.sleep(Native Method)at 
org.apache.hadoop.hdfs.PeerCache.run(PeerCache.java:252)at 
org.apache.hadoop.hdfs.PeerCache.access$000(PeerCache.java:39)at 
org.apache.hadoop.hdfs.PeerCache$1.run(PeerCache.java:135)at 
java.lang.Thread.run(Thread.java:745)
shuffle-client-1 daemon prio=10 tid=0x7f0650687000 nid=0x132fc runnable 
[0x7f060d198000]   java.lang.Thread.State: RUNNABLEat 
sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)at 
sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)at 
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)at 
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)- locked 
0x00067bf47710 (a io.netty.channel.nio.SelectedSelectionKeySet)- 
locked 0x00067bf374e8 (a java.util.Collections$UnmodifiableSet)- 
locked 0x00067bf373d0 (a sun.nio.ch.EPollSelectorImpl)at 
sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)at 
io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:622)at 
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:310)at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
Meantime, I can confirm our Hadoop/HDFS cluster works fine, as the MapReduce 
jobs also run without any problem, and Hadoop fs command works fine in the 
BigInsight.
I attached the jstack output with this email, but I don't know what could be 
the root reason.The same Spark shell command works fine, if I point to the 
small dataset, instead of big dataset. The small dataset

Spark SQL query AVRO file

2015-08-07 Thread java8964

Hi, Spark users:
We currently are using Spark 1.2.2 + Hive 0.12 + Hadoop 2.2.0 on our production 
cluster, which has 42 data/task nodes.
There is one dataset stored as Avro files about 3T. Our business has a complex 
query running for the dataset, which is stored in nest structure with Array of 
Struct in Avro and Hive.
We can query it using Hive without any problem, but we like the SparkSQL's 
performance, so we in fact run the same query in the Spark SQL, and found out 
it is in fact much faster than Hive.
But when we run it, we got the following error randomly from Spark executors, 
sometime seriously enough to fail the whole spark job.
Below the stack trace, and I think it is a bug related to Spark due to:
1) The error jumps out inconsistent, as sometimes we won't see it for this job. 
(We run it daily)2) Sometime it won't fail our job, as it recover after 
retry.3) Sometime it will fail our job, as I listed below.4) Is this due to the 
multithreading in Spark? The NullPointException indicates Hive got a Null 
ObjectInspector of the children of StructObjectInspector, as I read the Hive 
source code, but I know there is no null of ObjectInsepector as children of 
StructObjectInspector. Google this error didn't give me any hint. Does any one 
know anything like this?
Project 
[HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcatWS(,,CAST(account_id#23L,
 StringType),CAST(gross_contact_count_a#4L, StringType),CASE WHEN IS NULL 
tag_cnt#21 THEN 0 ELSE CAST(tag_cnt#21, StringType),CAST(list_cnt_a#5L, 
StringType),CAST(active_contact_count_a#16L, 
StringType),CAST(other_api_contact_count_a#6L, 
StringType),CAST(fb_api_contact_count_a#7L, 
StringType),CAST(evm_contact_count_a#8L, 
StringType),CAST(loyalty_contact_count_a#9L, 
StringType),CAST(mobile_jmml_contact_count_a#10L, 
StringType),CAST(savelocal_contact_count_a#11L, 
StringType),CAST(siteowner_contact_count_a#12L, 
StringType),CAST(socialcamp_service_contact_count_a#13L, 
S...org.apache.spark.SparkException: Job aborted due to stage failure: Task 58 
in stage 1.0 failed 4 times, most recent failure: Lost task 58.3 in stage 1.0 
(TID 257, 10.20.95.146): java.lang.NullPointerExceptionat 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.supportedCategories(AvroObjectInspectorGenerator.java:139)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:89)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:101)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:117)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspector(AvroObjectInspectorGenerator.java:81)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.init(AvroObjectInspectorGenerator.java:55)
at 
org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:69)  
  at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:112)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:109)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618)at 
org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:247)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:247)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:247)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:247)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:247)at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)  
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) 
   at org.apache.spark.scheduler.Task.run(Task.scala:56)at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:198)at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
   at

RE: Spark SQL query AVRO file

2015-08-07 Thread java8964

Hi, Michael:
I am not sure how spark-avro can help in this case. 
My understanding is that to use Spark-avro, I have to translate all the logic 
from this big Hive query into Spark code, right?
If I have this big Hive query, how I can use spark-avro to run the query?
Thanks
Yong

From: mich...@databricks.com
Date: Fri, 7 Aug 2015 11:32:21 -0700
Subject: Re: Spark SQL query AVRO file
To: java8...@hotmail.com
CC: user@spark.apache.org

Have you considered trying Spark SQL's native support for avro data?
https://github.com/databricks/spark-avro

On Fri, Aug 7, 2015 at 11:30 AM, java8964 java8...@hotmail.com wrote:

Hi, Spark users:
We currently are using Spark 1.2.2 + Hive 0.12 + Hadoop 2.2.0 on our production 
cluster, which has 42 data/task nodes.
There is one dataset stored as Avro files about 3T. Our business has a complex 
query running for the dataset, which is stored in nest structure with Array of 
Struct in Avro and Hive.
We can query it using Hive without any problem, but we like the SparkSQL's 
performance, so we in fact run the same query in the Spark SQL, and found out 
it is in fact much faster than Hive.
But when we run it, we got the following error randomly from Spark executors, 
sometime seriously enough to fail the whole spark job.
Below the stack trace, and I think it is a bug related to Spark due to:
1) The error jumps out inconsistent, as sometimes we won't see it for this job. 
(We run it daily)2) Sometime it won't fail our job, as it recover after 
retry.3) Sometime it will fail our job, as I listed below.4) Is this due to the 
multithreading in Spark? The NullPointException indicates Hive got a Null 
ObjectInspector of the children of StructObjectInspector, as I read the Hive 
source code, but I know there is no null of ObjectInsepector as children of 
StructObjectInspector. Google this error didn't give me any hint. Does any one 
know anything like this?
Project 
[HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcatWS(,,CAST(account_id#23L,
 StringType),CAST(gross_contact_count_a#4L, StringType),CASE WHEN IS NULL 
tag_cnt#21 THEN 0 ELSE CAST(tag_cnt#21, StringType),CAST(list_cnt_a#5L, 
StringType),CAST(active_contact_count_a#16L, 
StringType),CAST(other_api_contact_count_a#6L, 
StringType),CAST(fb_api_contact_count_a#7L, 
StringType),CAST(evm_contact_count_a#8L, 
StringType),CAST(loyalty_contact_count_a#9L, 
StringType),CAST(mobile_jmml_contact_count_a#10L, 
StringType),CAST(savelocal_contact_count_a#11L, 
StringType),CAST(siteowner_contact_count_a#12L, 
StringType),CAST(socialcamp_service_contact_count_a#13L, 
S...org.apache.spark.SparkException: Job aborted due to stage failure: Task 58 
in stage 1.0 failed 4 times, most recent failure: Lost task 58.3 in stage 1.0 
(TID 257, 10.20.95.146): java.lang.NullPointerExceptionat 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.supportedCategories(AvroObjectInspectorGenerator.java:139)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:89)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:101)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:117)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspector(AvroObjectInspectorGenerator.java:81)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.init(AvroObjectInspectorGenerator.java:55)
at 
org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:69)  
  at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:112)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:109)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618)at 
org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:247)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:247)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:247)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:247)at 
org.apache.spark.rdd.MapPartitionsRDD.compute

RE: Spark SQL query AVRO file

2015-08-07 Thread java8964

Good to know that.
Let me research it and give it a try.
Thanks
Yong

From: mich...@databricks.com
Date: Fri, 7 Aug 2015 11:44:48 -0700
Subject: Re: Spark SQL query AVRO file
To: java8...@hotmail.com
CC: user@spark.apache.org

You can register your data as a table using this library and then query it 
using HiveQL
CREATE TEMPORARY TABLE episodes
USING com.databricks.spark.avro
OPTIONS (path src/test/resources/episodes.avro)
On Fri, Aug 7, 2015 at 11:42 AM, java8964 java8...@hotmail.com wrote:

Hi, Michael:
I am not sure how spark-avro can help in this case. 
My understanding is that to use Spark-avro, I have to translate all the logic 
from this big Hive query into Spark code, right?
If I have this big Hive query, how I can use spark-avro to run the query?
Thanks
Yong

From: mich...@databricks.com
Date: Fri, 7 Aug 2015 11:32:21 -0700
Subject: Re: Spark SQL query AVRO file
To: java8...@hotmail.com
CC: user@spark.apache.org

Have you considered trying Spark SQL's native support for avro data?
https://github.com/databricks/spark-avro

On Fri, Aug 7, 2015 at 11:30 AM, java8964 java8...@hotmail.com wrote:

Hi, Spark users:
We currently are using Spark 1.2.2 + Hive 0.12 + Hadoop 2.2.0 on our production 
cluster, which has 42 data/task nodes.
There is one dataset stored as Avro files about 3T. Our business has a complex 
query running for the dataset, which is stored in nest structure with Array of 
Struct in Avro and Hive.
We can query it using Hive without any problem, but we like the SparkSQL's 
performance, so we in fact run the same query in the Spark SQL, and found out 
it is in fact much faster than Hive.
But when we run it, we got the following error randomly from Spark executors, 
sometime seriously enough to fail the whole spark job.
Below the stack trace, and I think it is a bug related to Spark due to:
1) The error jumps out inconsistent, as sometimes we won't see it for this job. 
(We run it daily)2) Sometime it won't fail our job, as it recover after 
retry.3) Sometime it will fail our job, as I listed below.4) Is this due to the 
multithreading in Spark? The NullPointException indicates Hive got a Null 
ObjectInspector of the children of StructObjectInspector, as I read the Hive 
source code, but I know there is no null of ObjectInsepector as children of 
StructObjectInspector. Google this error didn't give me any hint. Does any one 
know anything like this?
Project 
[HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcatWS(,,CAST(account_id#23L,
 StringType),CAST(gross_contact_count_a#4L, StringType),CASE WHEN IS NULL 
tag_cnt#21 THEN 0 ELSE CAST(tag_cnt#21, StringType),CAST(list_cnt_a#5L, 
StringType),CAST(active_contact_count_a#16L, 
StringType),CAST(other_api_contact_count_a#6L, 
StringType),CAST(fb_api_contact_count_a#7L, 
StringType),CAST(evm_contact_count_a#8L, 
StringType),CAST(loyalty_contact_count_a#9L, 
StringType),CAST(mobile_jmml_contact_count_a#10L, 
StringType),CAST(savelocal_contact_count_a#11L, 
StringType),CAST(siteowner_contact_count_a#12L, 
StringType),CAST(socialcamp_service_contact_count_a#13L, 
S...org.apache.spark.SparkException: Job aborted due to stage failure: Task 58 
in stage 1.0 failed 4 times, most recent failure: Lost task 58.3 in stage 1.0 
(TID 257, 10.20.95.146): java.lang.NullPointerExceptionat 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.supportedCategories(AvroObjectInspectorGenerator.java:139)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:89)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:101)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:117)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspector(AvroObjectInspectorGenerator.java:81)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.init(AvroObjectInspectorGenerator.java:55)
at 
org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:69)  
  at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:112)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:109)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618)at 
org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:247)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:247

RE: Use rank with distribute by in HiveContext

2015-07-16 Thread java8964

Yes. The HIVE UDF and distribute by both supported by Spark SQL.
If you are using Spark 1.4, you can try Hive analytics windows function 
(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics),most
 of which are already supported in Spark 1.4, so you don't need the customize 
UDF of rank.
Yong
Date: Thu, 16 Jul 2015 15:10:58 +0300
Subject: Use rank with distribute by in HiveContext
From: lio...@taboola.com
To: user@spark.apache.org

Does spark HiveContext support the rank() ... distribute by syntax (as in the 
following article- 
http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/doing_rank_with_hive 
)?
If not, how can it be achieved?
Thanks,Lior

RE: How does spark manage the memory of executor with multiple tasks

2015-05-27 Thread java8964

Same as you, there are lots of people coming from MapReduce world, and try to 
understand the internals of Spark. Hope below can help you some way.
For the end users, they only have concept of Job. I want to run a word count 
job from this one big file, that is the job I want to run. How many stages and 
tasks this job will generate depends on the file size and parallelism you 
specify in your job.
For word count, it will generate 2 stages, as we have shuffle in it, thinking 
it the same way as Mapper and Reducer part.
If the file is 1280M size in HDFS with 128M block, so the first stage will 
generate 10 tasks. If you use the default parallelism in spark, the 2nd stage 
should generate 200 tasks.
Forget about Executors right now, so the above job will have 210 tasks to run. 
In the standalone mode, you need to specify the cores and memory for your job. 
Let's assume you have 5 worker nodes with 4 cores + 8G each. Now, if you ask 10 
cores and 2G per executor, and cluster does have the enough resources 
available, then you will get 1 executor from each work node, with 2 cores + 2G 
per executor to run your job.In this case, first 10 tasks in the stage one can 
start concurrently at the same time, after that, every 10 tasks in stage 2 can 
be run concurrently. You get 5 executors, as you have 5 worker nodes. There is 
a coming feature to start multi executors per worker, but we are talking about 
the normally case here. In fact, you can start multi workers in one physical 
box, if you have enough resource.
In the above case, 2 tasks will be run concurrently per executor. You control 
this by specify how many cores you want for your job, plus how many workers in 
your cluster as pre configured. These 2 tasks have to share the 2G heap memory. 
I don't think specifying the memory per task is a good idea, as task is running 
in the Thread level, and Memory only apply for the JVM processor. 
In MR, every mapper and reducer match to a java processing, but in spark, the 
task is just matching with a thread/core.
In Spark, memory tuning is more like an art, but still have lot of rules to 
follow. In the above case, you can increase the parallelism to 400, then you 
will have 400 tasks in the stage 2, so each task will come with less data, 
provided you have much large unique words in the file. Or you can lower the 
cores from 10 to 5, then each executor will only process one task at a time, 
but your job will run slower.
Overall, you want to max the parallelism to gain the best speed, but also make 
sure the memory is enough for your job at this speed, to avoid OOM. It is a 
balance.
Keep in mind:Cluster pre-config with number of workers with total cores + max 
heap memory you can askPer application, you specify total cores you want + heap 
memory per executorIn your application, you can specify the parallelism level, 
as lots of Action supporting it. So parallelism is dynamic, from job to job, 
or even from stage to stage.
Yong
Date: Wed, 27 May 2015 15:48:57 +0800
Subject: Re: How does spark manage the memory of executor with multiple tasks
From: ccn...@gmail.com
To: evo.efti...@isecc.com
CC: ar...@sigmoidanalytics.com; user@spark.apache.org

Does anyone can answer my question ? I am curious to know if there's multiple 
reducer tasks in one executor, how to allocate memory between these reducers 
tasks since each shuffle will consume a lot of memory ?
On Tue, May 26, 2015 at 7:27 PM, Evo Eftimov evo.efti...@isecc.com wrote:
 the link you sent says multiple executors per node
Worker is just demon process launching Executors / JVMs so it can execute tasks 
- it does that by cooperating with the master and the driver 
There is a one to one maping between Executor and JVM 

Sent from Samsung Mobile

 Original message From: Arush Kharbanda  Date:2015/05/26  10:55 
 (GMT+00:00) To: canan chen  Cc: Evo Eftimov ,user@spark.apache.org Subject: 
Re: How does spark manage the memory of executor with multiple tasks 
Hi Evo,
Worker is the JVM and an executor runs on the JVM. And after Spark 1.4 you 
would be able to run multiple executors on the same JVM/worker.
https://issues.apache.org/jira/browse/SPARK-1706.

ThanksArush
On Tue, May 26, 2015 at 2:54 PM, canan chen ccn...@gmail.com wrote:
I think the concept of task in spark should be on the same level of task in MR. 
Usually in MR, we need to specify the memory the each mapper/reducer task. And 
I believe executor is not a user-facing concept, it's a spark internal concept. 
For spark users they don't need to know the concept of executor, but need to 
know the concept of task. 
On Tue, May 26, 2015 at 5:09 PM, Evo Eftimov evo.efti...@isecc.com wrote:
This is the first time I hear that “one can specify the RAM per task” – the RAM 
is granted per Executor (JVM). On the other hand each Task operates on ONE RDD 
Partition – so you can say that this is “the RAM allocated to the Task to 
process” – but it is still within the boundaries allocated to the Executor

RE: 回复：回复：RE: 回复：Re: sparksql running slow while joining_2_tables.

2015-05-06 Thread java8964

It looks like you have data in these 24 partitions, or more. How many unique 
name in your data set?
Enlarge the shuffle partitions only make sense if you have large partition 
groups in your data. What you described looked like either your dataset having 
data in these 24 partitions, or you have skew data in these 24 partitions.
If you really join a 56M data with 26M data, I am surprised that you will have 
24 partitions running very slow, under 8G executor.
Yong
Date: Wed, 6 May 2015 14:04:11 +0800
From: luohui20...@sina.com
To: luohui20...@sina.com; hao.ch...@intel.com; daoyuan.w...@intel.com; 
ssab...@gmail.com; user@spark.apache.org
Subject: 回复：回复：RE: 回复：Re: sparksql running slow while joining_2_tables.

update status after i did some tests. I modified some other parameters, found 2 
parameters maybe relative.
spark_worker_instance and spark.sql.shuffle.partitions

before Today I used default setting of spark_worker_instance and 
spark.sql.shuffle.partitions whose value is 1 and 200.At that time , my app 
stops running at 5/200tasks.

then I changed spark_worker_instance to 2, then my app process moved on to 
about 116/200 tasks.and then changed spark_worker_instance to 4, then I can get 
a further progress at 176/200.however when i changed to 8 or even more ,like 12 
works, it is still 176/200

Later new founds comes to me while I am trying with different 
spark.sql.shuffle.partitions. If I changed to 50,400,800 partitions, it stops 
at 26/50, 376/400,776/800 tasks. always leaving 24 tasks unable to finish.

Not sure why those happens.Hope this info could be helpful to solve it.

Thanksamp;Best regards!
罗辉 San.Luo

- 原始邮件 -
发件人：luohui20...@sina.com
收件人：Cheng, Hao hao.ch...@intel.com, Wang, Daoyuan 
daoyuan.w...@intel.com, Olivier Girardot ssab...@gmail.com, user 
user@spark.apache.org,
主题：回复：RE: 回复：Re: sparksql running slow while joining_2_tables.
日期：2015年05月06日 09点51分

db has 1.7million records while sample has 0.6million. jvm settings i tried 
default settings and also tried to apply 4g by export _java_opts 4g, app 
still stops running.
BTW, here are some details info about gc and jvm.
- 原始邮件 -
发件人：Cheng, Hao hao.ch...@intel.com
收件人：luohui20...@sina.com luohui20...@sina.com, Wang, Daoyuan 
daoyuan.w...@intel.com, Olivier Girardot ssab...@gmail.com, user 
user@spark.apache.org
主题：RE: 回复：Re: sparksql running slow while joining_2_tables.
日期：2015年05月05日 20点50分
56mb / 26mb is very small size, do you observe data skew? More precisely, many 
records with the same chrname / name?  And can you also double check the jvm 
settings
 for the executor process?

From: luohui20...@sina.com [mailto:luohui20...@sina.com]
Sent: Tuesday, May 5, 2015 7:50 PM
To: Cheng, Hao; Wang, Daoyuan; Olivier Girardot; user
Subject: 回复：Re: sparksql running slow while joining_2_tables.

Hi guys,
  attache the pic of physical plan and logs.Thanks.

Thanksamp;Best regards!
罗辉 San.Luo

- 
原始邮件 -
发件人：Cheng, Hao hao.ch...@intel.com
收件人：Wang, Daoyuan daoyuan.w...@intel.com, luohui20...@sina.com 
luohui20...@sina.com,
 Olivier Girardot ssab...@gmail.com, user user@spark.apache.org
主题：Re: sparksql running slow while joining_2_tables.
日期：2015年05月05日 13点18分

I assume you’re using the DataFrame API within your application.

sql(“SELECT…”).explain(true)

From: Wang, Daoyuan
Sent: Tuesday, May 5, 2015 10:16 AM
To: luohui20...@sina.com; Cheng, Hao; Olivier Girardot; user
Subject: RE: 回复：RE:
回复：Re: sparksql running slow while joining_2_tables.

You can use
Explain extended select ….

From:
luohui20...@sina.com [mailto:luohui20...@sina.com]
Sent: Tuesday, May 05, 2015 9:52 AM
To: Cheng, Hao; Olivier Girardot; user
Subject: 回复：RE:
回复：Re: sparksql running slow while joining_2_tables.

As I know broadcastjoin is automatically enabled by 
spark.sql.autoBroadcastJoinThreshold.
refer to 
http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options

and how to check my app's physical plan,and others things like optimized 
plan,executable plan.etc

thanks

Thanksamp;Best regards!
罗辉 San.Luo

- 原始邮件 -
发件人：Cheng, Hao hao.ch...@intel.com
收件人：Cheng, Hao hao.ch...@intel.com, luohui20...@sina.com 
luohui20...@sina.com,
 Olivier Girardot ssab...@gmail.com, user user@spark.apache.org
主题：RE: 
回复：Re: sparksql running slow while joining_2_tables.
日期：2015年05月05日 08点38分

Or, have you ever try broadcast join?

From: Cheng, Hao [mailto:hao.ch...@intel.com]
Sent: Tuesday, May 5, 2015 8:33 AM
To: luohui20...@sina.com; Olivier Girardot; user
Subject: RE: 回复：Re: sparksql running slow while joining 2 tables.

Can you print out the physical plan?

EXPLAIN SELECT xxx…

From: luohui20...@sina.com [mailto:luohui20...@sina.com]
Sent: Monday, May 4, 2015 9:08 PM
To: Olivier Girardot; user
Subject: 回复：Re: sparksql running slow while joining 2 tables.

hi Olivier
spark1.3.1, with

RE: Expert advise needed. (POC is at crossroads)

2015-04-30 Thread java8964

Really not expert here, but try the following ideas:
1) I assume you are using yarn, then this blog is very good about the resource 
tuning: 
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
2) If 12G is a hard limit in this case, then you have no option but lower your 
concurrency. Try starting set --executor-cores=1 as first step, this will 
force each executor running with one task a time. This is worst efficient for 
your job, but try to see if your application can be finished without OOM.
3) Add more partitions for your RDD. For a given RDD, larger partitions means 
each partition will contain less data, which requires less memory to process 
them, and if each one processed by 1 core in each executor, that means you 
almost lower your memory requirement for executor to the lowest level.
4) Do you cache data? Don't cache them for now, and lower 
spark.storage.memoryFraction, so less memory preserved for cache.
Since your top priority is to avoid OOM, all the above steps will make the job 
run slower, or less efficient. In any case, first you should check your code 
logic, to see if there could be with any improvement, but we assume your code 
is already optimized, as in your email. If the above steps still cannot help 
your OOM, then maybe your data for one partition just cannot fit with 12G heap, 
based on the logic you try to do in your code.
Yong
From: deepuj...@gmail.com
Date: Thu, 30 Apr 2015 18:48:12 +0530
Subject: Expert advise needed. (POC is at crossroads)
To: user@spark.apache.org

I am at crossroads now and expert advise help me decide what the next course of 
the project going to be.
Background : At out company we process tons of data to help build 
experimentation platform. We fire more than 300s of M/R jobs, Peta bytes of 
data, takes 24 hours and does lots of joins. Its simply stupendously complex. 
POC: Migrate a small portion of processing to Spark and aim to achieve 10x 
gains. Today this processing on M/R world takes 2.5 to 3 Hours. 
Data Sources: 3 (All on HDFS). Format: Two in Sequence File and one in AvroData 
Size:1)  64 files  169,380,175,136 bytes- Sequence







2) 101 files84,957,259,664 bytes- Avro3) 744 files   
1,972,781,123,924 bytes- Sequence
ProcessA) Map Side Join of #1 and #2B) Left Outer Join of A) and #3C) Reduce By 
Key of B)D) Map Only processing of C.
Optimizations1) Converted Equi-Join to Map-Side  (Broadcast variables ) Join 
#A.2) Converted groupBy + Map = ReduceBy Key #C.
I have a huge YARN (Hadoop 2.4.x) cluster at my disposal but I am limited to 
use only 12G on each node.
1) My poc (after a month of crazy research, lots of QA on this amazing forum) 
runs fine with 1 file each from above data sets and finishes in 10 mins taking 
4 executors. I started with 60 mins and got it down to 10 mins.2) For 5 files 
each data set it takes 45 mins and 16 executors.3) When i run against 10 files, 
it fails repeatedly with OOM and several timeout errors.Configs:  
--num-executors 96 --driver-memory 12g --driver-java-options 
-XX:MaxPermSize=10G --executor-memory 12g --executor-cores 4, Spark 1.3.1









Expert AdviceMy goal is simple to be able to complete the processing at 10x to 
100x speed than M/R or show its not possible with Spark.
A) 10x to 100x1) What will it take in terms of # of executors, # of 
executor-cores ?  amount of memory on each executor and some unknown magic 
settings that am suppose to do to reach this goal ?2) I am attaching the code 
for review that can further speed up processing, if at all its possible ?3) Do 
i need to do something else ?
B) Give up and wait for next amazing tech to come upGiven the steps that i have 
performed so far, should i conclude that its not possible to achieve 10x to 
100x gains and am stuck with M/R world for now.
I am in need of help here. I am available for discussion at any time 
(day/night).
Hope i provided all the details.Regards,
Deepak



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-13 Thread java8964

If it is really due to data skew, will the task hanging has much bigger Shuffle 
Write Size in this case?
In this case, the shuffle write size for that task is 0, and the rest IO of 
this task is not much larger than the fast finished tasks, is that normal?
I am also interested in this case, as from statistics on the UI, how it 
indicates the task could have skew data?
Yong 

Date: Mon, 13 Apr 2015 12:58:12 -0400
Subject: Re: Equi Join is taking for ever. 1 Task is Running while other 199 
are complete
From: jcove...@gmail.com
To: deepuj...@gmail.com
CC: user@spark.apache.org

I can promise you that this is also a problem in the pig world :) not sure why 
it's not a problem for this data set, though... are you sure that the two are 
doing the exact same code?
you should inspect your source data. Make a histogram for each and see what the 
data distribution looks like. If there is a value or bucket with a 
disproportionate set of values you know you have an issue
2015-04-13 12:50 GMT-04:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com:
You mean there is a tuple in either RDD, that has itemID = 0 or null ? And what 
is catch all ?
That implies is it a good idea to run a filter on each RDD first ? We do not do 
this using Pig on M/R. Is it required in Spark world ?
On Mon, Apr 13, 2015 at 9:58 PM, Jonathan Coveney jcove...@gmail.com wrote:
My guess would be data skew. Do you know if there is some item id that is a 
catch all? can it be null? item id 0? lots of data sets have this sort of value 
and it always kills joins
2015-04-13 11:32 GMT-04:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com:
Code:
val viEventsWithListings: RDD[(Long, (DetailInputRecord, VISummary, Long))] = 
lstgItem.join(viEvents).map {  case (itemId, (listing, viDetail)) =
val viSummary = new VISummaryviSummary.leafCategoryId = 
listing.getLeafCategId().toIntviSummary.itemSiteId = 
listing.getItemSiteId().toIntviSummary.auctionTypeCode = 
listing.getAuctTypeCode().toIntviSummary.sellerCountryId = 
listing.getSlrCntryId().toIntviSummary.buyerSegment = 0
viSummary.isBin = (if (listing.getBinPriceLstgCurncy.doubleValue()  0) 1 else 
0)val sellerId = listing.getSlrId.toLong(sellerId, (viDetail, 
viSummary, itemId))}
Running Tasks:Tasks
  IndexIDAttemptStatus ▾Locality LevelExecutor ID / HostLaunch 
TimeDurationGC TimeShuffle Read Size / RecordsWrite TimeShuffle Write Size / 
RecordsShuffle Spill (Memory)Shuffle Spill (Disk)Errors

0
216
0
RUNNING
PROCESS_LOCAL
181 / phxaishdc9dn0474.phx.ebay.com
2015/04/13 06:43:53

  1.7 h

  13 min

 3.0 GB / 56964921

 0.0 B / 0

21.2 GB

1902.6 MB

2
218
0
SUCCESS
PROCESS_LOCAL
582 / phxaishdc9dn0235.phx.ebay.com
2015/04/13 06:43:53

  15 min

  31 s

 2.2 GB / 1666851

 0.1 s

 3.0 MB / 2062

54.8 GB

1924.5 MB

1
217
0
SUCCESS
PROCESS_LOCAL
202 / phxdpehdc9dn2683.stratus.phx.ebay.com
2015/04/13 06:43:53

  19 min

  1.3 min

 2.2 GB / 1687086

 75 ms

 3.9 MB / 2692

33.7 GB

1960.4 MB

4
220
0
SUCCESS
PROCESS_LOCAL
218 / phxaishdc9dn0855.phx.ebay.com
2015/04/13 06:43:53

  15 min

  56 s

 2.2 GB / 1675654

 40 ms

 3.3 MB / 2260

26.2 GB

1928.4 MB

Command:./bin/spark-submit -v --master yarn-cluster --driver-class-path 
/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/hdfs/hadoop-hdfs-2.4.1-EBAY-2.jar
 --jars 
/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/hdfs/hadoop-hdfs-2.4.1-EBAY-2.jar,/home/dvasthimal/spark1.3/spark_reporting_dep_only-1.0-SNAPSHOT.jar

RE: EC2 spark-submit --executor-memory

2015-04-08 Thread java8964

If you are using Spark Standalone deployment, make sure you set the 
WORKER_MEMROY over 20G, and you do have 20G physical memory.
Yong

 Date: Tue, 7 Apr 2015 20:58:42 -0700
 From: li...@adobe.com
 To: user@spark.apache.org
 Subject: EC2 spark-submit --executor-memory

 Dear Spark team,

 I'm using the EC2 script to startup a Spark cluster. If I login and use the
 executor-memory parameter in the submit script, the UI tells me that no
 cores are assigned to the job and nothing happens. Without executor-memory
 everything works fine... Until I get dag-scheduler-event-loop
 java.lang.OutOfMemoryError: Java heap space, but that's another issue.

 ./bin/spark-submit \
   --class ... \
   --executor-memory 20G \
   /path/to/examples.jar 

 Thanks.

 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/EC2-spark-submit-executor-memory-tp22417.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

RE: Reading file with Unicode characters

2015-04-08 Thread java8964

Spark use the Hadoop TextInputFormat to read the file. Since Hadoop is almost 
only supporting Linux, so UTF-8 is the only encoding supported, as it is the 
the one on Linux.
If you have other encoding data, you may want to vote for this 
Jira:https://issues.apache.org/jira/browse/MAPREDUCE-232
Yong

Date: Wed, 8 Apr 2015 10:35:18 -0700
Subject: Reading file with Unicode characters
From: lists.a...@gmail.com
To: user@spark.apache.org
CC: lists.a...@gmail.com

Hi,
Does SparkContext's textFile() method handle files with Unicode characters? How 
about files in UTF-8 format?
Going further, is it possible to specify encodings to the method? If not, what 
should one do if the files to be read are in some encoding?
Thanks,arun

RE: Incremently load big RDD file into Memory

2015-04-07 Thread java8964

cartesian is an expensive operation. If you have 'M' records in location, then 
locations. cartesian(locations) will generate MxM result.If locations is a big 
RDD, it is hard to do the locations. cartesian(locations) efficiently.Yong
 Date: Tue, 7 Apr 2015 10:04:12 -0700
 From: mas.ha...@gmail.com
 To: user@spark.apache.org
 Subject: Incremently load big RDD file into Memory

 val locations = filelines.map(line = line.split(\t)).map(t =
 (t(5).toLong, (t(2).toDouble, t(3).toDouble))).distinct().collect()

 val cartesienProduct=locations.cartesian(locations).map(t=
 Edge(t._1._1,t._2._1,distanceAmongPoints(t._1._2._1,t._1._2._2,t._2._2._1,t._2._2._2)))

 Code executes perfectly fine uptill here but when i try to use
 cartesienProduct it got stuck i.e.

 val count =cartesienProduct.count()

 Any help to efficiently do this will be highly appreciated.

 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Incremently-load-big-RDD-file-into-Memory-tp22410.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

RE: 'Java heap space' error occured when query 4G data file from HDFS

2015-04-07 Thread java8964

It is hard to guess why OOM happens without knowing your application's logic 
and the data size.
Without knowing that, I can only guess based on some common experiences:
1) increase spark.default.parallelism2) Increase your executor-memory, maybe 
6g is not just enough 3) Your environment is kind of unbalance between cup 
cores and available memory (8 cores vs 12G). Each core should have 3G for 
Spark.4) If you cache RDD, using MEMORY_ONLY_SER instead of MEMORY_ONLY5) 
Since your cores is much more compared with your available memory, lower the 
cores for executor by set -Dspark.deploy.defaultCores=. When you have not 
enough memory, reduce the concurrency of your executor, it will lower the 
memory requirement, with running in a slower speed.
Yong

Date: Wed, 8 Apr 2015 04:57:22 +0800
Subject: Re: 'Java heap space' error occured when query 4G data file from HDFS
From: lidali...@gmail.com
To: user@spark.apache.org

Any help?please.
Help me do a right configure.

李铖 lidali...@gmail.com于2015年4月7日星期二写道：
In my dev-test env .I have 3 virtual machines ,every machine have 12G memory,8 
cpu core.
Here is spark-defaults.conf,and spark-env.sh.Maybe some config is not right.
I run this command :spark-submit --master yarn-client --driver-memory 7g 
--executor-memory 6g /home/hadoop/spark/main.pyexception rised.
spark-defaults.conf
spark.master spark://cloud1:7077spark.default.parallelism   
100spark.eventLog.enabled   truespark.serializer 
org.apache.spark.serializer.KryoSerializerspark.driver.memory  
5gspark.driver.maxResultSize 6gspark.kryoserializer.buffer.mb   
256spark.kryoserializer.buffer.max.mb   512 spark.executor.memory   
4gspark.rdd.compresstruespark.storage.memoryFraction
0spark.akka.frameSize   50spark.shuffle.compress
truespark.shuffle.spill.compressfalsespark.local.dir
/home/hadoop/tmp
 spark-evn.sh
export SCALA=/home/hadoop/softsetup/scalaexport 
JAVA_HOME=/home/hadoop/softsetup/jdk1.7.0_71export SPARK_WORKER_CORES=1export 
SPARK_WORKER_MEMORY=4gexport HADOOP_CONF_DIR=/opt/cloud/hadoop/etc/hadoopexport 
SPARK_EXECUTOR_MEMORY=4gexport SPARK_DRIVER_MEMORY=4g
Exception:
15/04/07 18:11:03 INFO BlockManagerInfo: Added taskresult_28 on disk on 
cloud3:38109 (size: 162.7 MB)15/04/07 18:11:03 INFO BlockManagerInfo: Added 
taskresult_28 on disk on cloud3:38109 (size: 162.7 MB)15/04/07 18:11:03 INFO 
TaskSetManager: Starting task 31.0 in stage 1.0 (TID 31, cloud3, NODE_LOCAL, 
1296 bytes)15/04/07 18:11:03 INFO BlockManagerInfo: Added taskresult_29 on disk 
on cloud2:49451 (size: 163.7 MB)15/04/07 18:11:03 INFO BlockManagerInfo: Added 
taskresult_29 on disk on cloud2:49451 (size: 163.7 MB)15/04/07 18:11:03 INFO 
TaskSetManager: Starting task 30.0 in stage 1.0 (TID 32, cloud2, NODE_LOCAL, 
1296 bytes)15/04/07 18:11:03 ERROR Utils: Uncaught exception in thread 
task-result-getter-0java.lang.OutOfMemoryError: Java heap space   at 
org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:61)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:985)   
at 
org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:58)   
 at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) 
 at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)   at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:81)
  at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:73)
   at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:49)
  at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:49)
  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460) at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:48)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
 at java.lang.Thread.run(Thread.java:745)Exception in thread 
task-result-getter-0 java.lang.OutOfMemoryError: Java heap space  at 
org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:61)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:985)   
at 
org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:58)   
 at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) 
 at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)

RE: Reading a large file (binary) into RDD

2015-04-03 Thread java8964

Hadoop TextInputFormat is a good start.
It is not really that hard. You just need to implement the logic to identify 
the Record delimiter, and think a logic way to represent the Key, Value for 
your RecordReader.
Yong

From: kvi...@vt.edu
Date: Fri, 3 Apr 2015 11:41:13 -0400
Subject: Re: Reading a large file (binary) into RDD
To: deanwamp...@gmail.com
CC: java8...@hotmail.com; user@spark.apache.org

Thanks everyone for the inputs.
I guess I will try out a custom implementation of InputFormat. But I have no 
idea where to start. Are there any code examples of this that might help?
On Fri, Apr 3, 2015 at 9:15 AM, Dean Wampler deanwamp...@gmail.com wrote:
This might be overkill for your needs, but the scodec parser combinator library 
might be useful for creating a parser.
https://github.com/scodec/scodec
Dean Wampler, Ph.D.Author: Programming Scala, 2nd Edition (O'Reilly)
Typesafe
@deanwamplerhttp://polyglotprogramming.com

On Thu, Apr 2, 2015 at 6:53 PM, java8964 java8...@hotmail.com wrote:

I think implementing your own InputFormat and using SparkContext.hadoopFile() 
is the best option for your case.
Yong

From: kvi...@vt.edu
Date: Thu, 2 Apr 2015 17:31:30 -0400
Subject: Re: Reading a large file (binary) into RDD
To: freeman.jer...@gmail.com
CC: user@spark.apache.org

The file has a specific structure. I outline it below.
The input file is basically a representation of a graph.

INTINT(A)LONG (B)A INTs(Degrees)A SHORTINTs  
(Vertex_Attribute)B INTsB INTsB SHORTINTsB SHORTINTs

A - number of verticesB - number of edges (note that the INTs/SHORTINTs 
associated with this are edge attributes)
After reading in the file, I need to create two RDDs (one with vertices and the 
other with edges)
On Thu, Apr 2, 2015 at 4:46 PM, Jeremy Freeman freeman.jer...@gmail.com wrote:
Hm, that will indeed be trickier because this method assumes records are the 
same byte size. Is the file an arbitrary sequence of mixed types, or is there 
structure, e.g. short, long, short, long, etc.? 
If you could post a gist with an example of the kind of file and how it should 
look once read in that would be useful!

-
jeremyfreeman.net
@thefreemanlab

On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:
Thanks for the reply. Unfortunately, in my case, the binary file is a mix of 
short and long integers. Is there any other way that could of use here?
My current method happens to have a large overhead (much more than actual 
computation time). Also, I am short of memory at the driver when it has to read 
the entire file.
On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman freeman.jer...@gmail.com wrote:
If it’s a flat binary file and each record is the same length (in bytes), you 
can use Spark’s binaryRecords method (defined on the SparkContext), which loads 
records from one or more large flat binary files into an RDD. Here’s an example 
in python to show how it works:
# write data from an arrayfrom numpy import randomdat = random.randn(100,5)f = 
open('test.bin', 'w')f.write(dat)f.close()
# load the data back infrom numpy import frombuffernrecords = 5bytesize = 
8recordsize = nrecords * bytesizedata = sc.binaryRecords('test.bin', 
recordsize)parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 
'float'))

# these should be equalparsed.first()dat[0,:]
Does that help?
-
jeremyfreeman.net
@thefreemanlab

On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:
What are some efficient ways to read a large file into RDDs?
For example, have several executors read a specific/unique portion of the file 
and construct RDDs. Is this possible to do in Spark?
Currently, I am doing a line-by-line read of the file at the driver and 
constructing the RDD.

RE: Spark SQL. Memory consumption

2015-04-02 Thread java8964

It is hard to say what could be reason without more detail information. If you 
provide some more information, maybe people here can help you better.
1) What is your worker's memory setting? It looks like that your nodes have 
128G physical memory each, but what do you specify for the worker's heap size? 
If you can paste your spark-env.sh and spark-defaults.conf content here, it 
will be helpful.2) You are doing join with 2 tables. 8G parquet files is small, 
compared to the heap you gave. But is it for one table? 2 tables? Is the data 
compressed?3) Your join key is different as your grouping keys, so my 
assumption is that this query should lead to 4 stages (I could be wrong, as I 
am kind of new to Spark SQL too). Is that right? If so, on what stage the OOM 
happened? With this information, it can help us to better judge which part 
caused OOM.4) When you set the spark.shuffle.partitions to 1024, did the stage 
3 and 4 really create 1024 tasks? 5) When the OOM happens, at least you can 
past the stack track of OOM, so it will help people here to guess which part of 
Spark leads to the OOM, so give you better suggests.
Thanks
Yong

Date: Thu, 2 Apr 2015 17:46:48 +0200
Subject: Spark SQL. Memory consumption
From: masfwo...@gmail.com
To: user@spark.apache.org

Hi. 
I'm using Spark SQL 1.2. I have this query:
CREATE TABLE test_MA STORED AS PARQUET AS  SELECT   field1  ,field2 ,field3 
,field4 ,field5 ,COUNT(1) AS field6 ,MAX(field7),MIN(field8)
,SUM(field9 / 100)  ,COUNT(field10) ,SUM(IF(field11  -500, 1, 0))  
,MAX(field12)   ,SUM(IF(field13 = 1, 1, 0))
,SUM(IF(field13 in (3,4,5,6,10,104,105,107), 1, 0)) ,SUM(IF(field13 
= 2012 , 1, 0)) ,SUM(IF(field13 in (0,100,101,102,103,106), 1, 0))  
FROM table1 CL  JOIN table2 netwON CL.field15 = 
netw.id WHERE   AND field3 IS NOT NULL  AND 
field4 IS NOT NULL  AND field5 IS NOT NULL  GROUP BY 
field1,field2,field3,field4, netw.field5

spark-submit --master spark://master:7077 --driver-memory 20g --executor-memory 
60g --class GMain project_2.10-1.0.jar --driver-class-path 
'/opt/cloudera/parcels/CDH/lib/hive/lib/*' --driver-java-options 
'-Dspark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hive/lib/*' 2 
./error


Input data is 8GB in parquet format. Many times crash by GC overhead. I've 
fixed spark.shuffle.partitions to 1024 but my worker nodes (with 128GB 
RAM/node) is collapsed.
Is it a query too difficult to Spark SQL? Would It be better to do it in 
Spark?Am I doing something wrong?

Thanks-- 


Regards.
Miguel Ángel

Cannot run the example in the Spark 1.3.0 following the document

2015-04-02 Thread java8964

I tried to check out what Spark SQL 1.3.0. I installed it and following the 
online document here:
http://spark.apache.org/docs/latest/sql-programming-guide.html
In the example, it shows something like this:// Select everybody, but increment 
the age by 1
df.select(name, df(age) + 1).show()
// name(age + 1)
// Michael null
// Andy31
// Justin  20
But what I got on my Spark 1.3.0 is the following error:
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.3.0
  /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 
1.6.0_43)scala val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = 
org.apache.spark.sql.SQLContext@1c845f64
scala val df = sqlContext.jsonFile(/user/yzhang/people.json)df: 
org.apache.spark.sql.DataFrame = [age: bigint, name: string]scala 
df.printSchema
root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)scala df.select(name, df(age) + 
1).show()
console:30: error: overloaded method value select with alternatives:
  (col: String,cols: String*)org.apache.spark.sql.DataFrame and
  (cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame
 cannot be applied to (String, org.apache.spark.sql.Column)
  df.select(name, df(age) + 1).show()
 ^
Is this a bug in Spark 1.3.0, or my build having some problem?
Thanks

RE: Cannot run the example in the Spark 1.3.0 following the document

2015-04-02 Thread java8964

The import command already run.
Forgot the mention, the rest of examples related to df all works, just this 
one caused problem.
Thanks
Yong

Date: Fri, 3 Apr 2015 10:36:45 +0800
From: fightf...@163.com
To: java8...@hotmail.com; user@spark.apache.org
Subject: Re: Cannot run the example in the Spark 1.3.0 following the document


Hi, there 
you may need to add :   import sqlContext.implicits._
Best,Sun


fightf...@163.com
 From: java8964Date: 2015-04-03 10:15To: user@spark.apache.orgSubject: Cannot 
run the example in the Spark 1.3.0 following the document
I tried to check out what Spark SQL 1.3.0. I installed it and following the 
online document here:
http://spark.apache.org/docs/latest/sql-programming-guide.html
In the example, it shows something like this:// Select everybody, but increment 
the age by 1
df.select(name, df(age) + 1).show()
// name(age + 1)
// Michael null
// Andy31
// Justin  20
But what I got on my Spark 1.3.0 is the following error:
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.3.0
  /_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 
1.6.0_43)scala val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = 
org.apache.spark.sql.SQLContext@1c845f64
scala val df = sqlContext.jsonFile(/user/yzhang/people.json)df: 
org.apache.spark.sql.DataFrame = [age: bigint, name: string]scala 
df.printSchema
root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)scala df.select(name, df(age) + 
1).show()
console:30: error: overloaded method value select with alternatives:
  (col: String,cols: String*)org.apache.spark.sql.DataFrame and
  (cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame
 cannot be applied to (String, org.apache.spark.sql.Column)
  df.select(name, df(age) + 1).show()
 ^
Is this a bug in Spark 1.3.0, or my build having some problem?
Thanks

RE: Reading a large file (binary) into RDD

2015-04-02 Thread java8964

I think implementing your own InputFormat and using SparkContext.hadoopFile() 
is the best option for your case.
Yong

From: kvi...@vt.edu
Date: Thu, 2 Apr 2015 17:31:30 -0400
Subject: Re: Reading a large file (binary) into RDD
To: freeman.jer...@gmail.com
CC: user@spark.apache.org

The file has a specific structure. I outline it below.
The input file is basically a representation of a graph.

INTINT(A)LONG (B)A INTs(Degrees)A SHORTINTs  
(Vertex_Attribute)B INTsB INTsB SHORTINTsB SHORTINTs

A - number of verticesB - number of edges (note that the INTs/SHORTINTs 
associated with this are edge attributes)
After reading in the file, I need to create two RDDs (one with vertices and the 
other with edges)
On Thu, Apr 2, 2015 at 4:46 PM, Jeremy Freeman freeman.jer...@gmail.com wrote:
Hm, that will indeed be trickier because this method assumes records are the 
same byte size. Is the file an arbitrary sequence of mixed types, or is there 
structure, e.g. short, long, short, long, etc.? 
If you could post a gist with an example of the kind of file and how it should 
look once read in that would be useful!

-
jeremyfreeman.net
@thefreemanlab

On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:
Thanks for the reply. Unfortunately, in my case, the binary file is a mix of 
short and long integers. Is there any other way that could of use here?
My current method happens to have a large overhead (much more than actual 
computation time). Also, I am short of memory at the driver when it has to read 
the entire file.
On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman freeman.jer...@gmail.com wrote:
If it’s a flat binary file and each record is the same length (in bytes), you 
can use Spark’s binaryRecords method (defined on the SparkContext), which loads 
records from one or more large flat binary files into an RDD. Here’s an example 
in python to show how it works:
# write data from an arrayfrom numpy import randomdat = random.randn(100,5)f = 
open('test.bin', 'w')f.write(dat)f.close()
# load the data back infrom numpy import frombuffernrecords = 5bytesize = 
8recordsize = nrecords * bytesizedata = sc.binaryRecords('test.bin', 
recordsize)parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 
'float'))

# these should be equalparsed.first()dat[0,:]
Does that help?
-
jeremyfreeman.net
@thefreemanlab

On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:
What are some efficient ways to read a large file into RDDs?
For example, have several executors read a specific/unique portion of the file 
and construct RDDs. Is this possible to do in Spark?
Currently, I am doing a line-by-line read of the file at the driver and 
constructing the RDD.

RE: Spark SQL 1.3.0 - spark-shell error : HiveMetastoreCatalog.class refers to term cache in package com.google.common which is not available

2015-04-02 Thread java8964

Hmm, I just tested my own Spark 1.3.0 build. I have the same problem, but I 
cannot reproduce it on Spark 1.2.1
If we check the code change below:
Spark 1.3 
branchhttps://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
vs 
Spark 1.2 
branchhttps://github.com/apache/spark/blob/branch-1.2/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
You can see that on line 24:
import com.google.common.cache.{CacheBuilder, CacheLoader, LoadingCache}
is introduced on 1.3 branch.
The error basically mean runtime com.google.common.cache package cannot be 
found in the classpath.
Either you and me made the same mistake when we build Spark 1.3.0, or there are 
something wrong with Spark 1.3 pom.xml file.
Here is how I built the 1.3.0:
1) Download the spark 1.3.0 source2) make-distribution --targz 
-Dhadoop.version=1.1.1 -Phive -Phive-0.12.0 -Phive-thriftserver -DskipTests
Is this only due to that I built against Hadoop 1.x?
Yong

Date: Thu, 2 Apr 2015 13:56:33 -0400
Subject: Spark SQL 1.3.0 - spark-shell error : HiveMetastoreCatalog.class 
refers to term cache in package com.google.common which is not available
From: tsind...@gmail.com
To: user@spark.apache.org

I was trying a simple test from the spark-shell to see if 1.3.0 would address a 
problem I was having with locating the json_tuple class and got the following 
error:
scala import org.apache.spark.sql.hive._
import org.apache.spark.sql.hive._

scala val sqlContext = new HiveContext(sc)
sqlContext: org.apache.spark.sql.hive.HiveContext = 
org.apache.spark.sql.hive.HiveContext@79c849c7

scala import sqlContext._
import sqlContext._

scala case class MetricTable(path: String, pathElements: String, name: String, 
value: String)
scala.reflect.internal.Types$TypeError: bad symbolic reference. A signature in 
HiveMetastoreCatalog.class refers to term cache
in package com.google.common which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling 
HiveMetastoreCatalog.class.
That entry seems to have slain the compiler.  Shall I replay
your session? I can re-run each line except the last one.
[y/n]
Abandoning crashed session.I entered the shell as follows:./bin/spark-shell 
--master spark://radtech.io:7077 --total-executor-cores 2 --driver-class-path 
/usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jarhive-site.xml looks 
like this:?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?

configuration
  property
namehive.semantic.analyzer.factory.impl/name
valueorg.apache.hcatalog.cli.HCatSemanticAnalyzerFactory/value
  /property

  property
namehive.metastore.sasl.enabled/name
valuefalse/value
  /property

  property
namehive.server2.authentication/name
valueNONE/value
  /property

  property
namehive.server2.enable.doAs/name
valuetrue/value
  /property

  property
namehive.warehouse.subdir.inherit.perms/name
valuetrue/value
  /property

  property
namehive.metastore.schema.verification/name
valuefalse/value
  /property

  property
namejavax.jdo.option.ConnectionURL/name

valuejdbc:mysql://localhost:3306/metastore_db?createDatabaseIfNotExist=true/value
descriptionmetadata is stored in a MySQL server/description
  /property

  property
namejavax.jdo.option.ConnectionDriverName/name
valuecom.mysql.jdbc.Driver/value
descriptionMySQL JDBC driver class/description
  /property

  property
namejavax.jdo.option.ConnectionUserName/name
value***/value
  /property

  property
namejavax.jdo.option.ConnectionPassword/name
value/value
  /property

/configurationI have downloaded a clean version of 1.3.0 and tried it again 
but same error. Is this a know issue? Or a configuration issue on my part?TIA 
for the assistances.-Todd

RE: SparkSql - java.util.NoSuchElementException: key not found: node when access JSON Array

2015-03-31 Thread java8964

You can use the HiveContext instead of SQLContext, which should support all the 
HiveQL, including lateral view explode.
SQLContext is not supporting that yet.
BTW, nice coding format in the email.
Yong

Date: Tue, 31 Mar 2015 18:18:19 -0400
Subject: Re: SparkSql - java.util.NoSuchElementException: key not found: node 
when access JSON Array
From: tsind...@gmail.com
To: user@spark.apache.org

So in looking at this a bit more, I gather the root cause is the fact that the 
nested fields are represented as rows within rows, is that correct?  If I don't 
know the size of the json array (it varies), using x.getAs[Row](0).getString(0) 
is not really a valid solution.  
Is the solution to apply a lateral view + explode to this? 
I have attempted to change to a lateral view, but looks like my syntax is off:








sqlContext.sql(
SELECT path,`timestamp`, name, value, pe.value FROM metric 
 lateral view explode(pathElements) a AS pe)
.collect.foreach(println(_))
Which results in:
15/03/31 17:38:34 INFO ContextCleaner: Cleaned broadcast 0
Exception in thread main java.lang.RuntimeException: [1.68] failure: 
``UNION'' expected but identifier view found

SELECT path,`timestamp`, name, value, pe.value FROM metric lateral view 
explode(pathElements) a AS pe
   ^
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33)
at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
at 
org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174)
at 
org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:31)
at 
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83)
at 
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:83)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:303)
at 
com.opsdatastore.elasticsearch.spark.ElasticSearchReadWrite$.main(ElasticSearchReadWrite.scala:97)
at 
com.opsdatastore.elasticsearch.spark.ElasticSearchReadWrite.main(ElasticSearchReadWrite.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Is this the 
right approach?  Is this syntax available in 1.2.1:
SELECT
  v1.name, v2.city, v2.state 
FROM people
  LATERAL VIEW json_tuple(people.jsonObject, 'name', 'address') v1 
 as name, address
  LATERAL VIEW json_tuple(v1.address, 'city', 'state') v2
 as city, state;
-Todd
On Tue, Mar 31, 2015 at 3:26 PM, Todd Nist tsind...@gmail.com wrote:
I am accessing ElasticSearch via the

RE: java.io.FileNotFoundException when using HDFS in cluster mode

2015-03-30 Thread java8964

I think the jar file has to be local. In HDFS is not supported yet in Spark.
See this answer:
http://stackoverflow.com/questions/28739729/spark-submit-not-working-when-application-jar-is-in-hdfs

 Date: Sun, 29 Mar 2015 22:34:46 -0700
 From: n.e.trav...@gmail.com
 To: user@spark.apache.org
 Subject: java.io.FileNotFoundException when using HDFS in cluster mode

 Hi List,

 I'm following this example  here
 https://github.com/databricks/learning-spark/tree/master/mini-complete-example

 with the following:

 $SPARK_HOME/bin/spark-submit \
   --deploy-mode cluster \
   --master spark://host.domain.ex:7077 \
   --class com.oreilly.learningsparkexamples.mini.scala.WordCount \

 hdfs://host.domain.ex/user/nickt/learning-spark-mini-example_2.10-0.0.1.jar
 \
   hdfs://host.domain.ex/user/nickt/linkage
 hdfs://host.domain.ex/user/nickt/wordcounts

 The jar is submitted fine and I can see it appear on the driver node (i.e.
 connecting to and reading from HDFS ok):

 -rw-r--r-- 1 nickt nickt  15K Mar 29 22:05
 learning-spark-mini-example_2.10-0.0.1.jar
 -rw-r--r-- 1 nickt nickt 9.2K Mar 29 22:05 stderr
 -rw-r--r-- 1 nickt nickt0 Mar 29 22:05 stdout

 But it's failing due to a java.io.FileNotFoundException saying my input file
 is missing:

 Caused by: java.io.FileNotFoundException: Added file
 file:/home/nickt/spark-1.3.0/work/driver-20150329220503-0021/hdfs:/host.domain.ex/user/nickt/linkage
 does not exist.

 I'm using sc.addFile(hdfs://path/to/the_file.txt) to propagate to all the
 workers and sc.textFile(SparkFiles(the_file.txt)) to return the path to
 the file on each of the hosts.

 Has anyone come up against this before when reading from HDFS? No doubt I'm
 doing something wrong.

 Full trace below:

 Launch Command: /usr/java/java8/bin/java -cp
 :/home/nickt/spark-1.3.0/conf:/home/nickt/spark-1.3.0/assembly/target/scala-2.10/spark-assembly-1.3.0-hadoop2.0.0-mr1-cdh4.6.0.jar
 -Dakka.loglevel=WARNING -Dspark.driver.supervise=false
 -Dspark.app.name=com.oreilly.learningsparkexamples.mini.scala.WordCount
 -Dspark.akka.askTimeout=10
 -Dspark.jars=hdfs://host.domain.ex/user/nickt/learning-spark-mini-example_2.10-0.0.1.jar
 -Dspark.master=spark://host.domain.ex:7077 -Xms512M -Xmx512M
 org.apache.spark.deploy.worker.DriverWrapper
 akka.tcp://sparkwor...@host5.domain.ex:40830/user/Worker
 /home/nickt/spark-1.3.0/work/driver-20150329220503-0021/learning-spark-mini-example_2.10-0.0.1.jar
 com.oreilly.learningsparkexamples.mini.scala.WordCount
 hdfs://host.domain.ex/user/nickt/linkage
 hdfs://host.domain.ex/user/nickt/wordcounts

 log4j:WARN No appenders could be found for logger
 (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
 log4j:WARN Please initialize the log4j system properly.
 log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
 more info.
 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 15/03/29 22:05:05 INFO SecurityManager: Changing view acls to: nickt
 15/03/29 22:05:05 INFO SecurityManager: Changing modify acls to: nickt
 15/03/29 22:05:05 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(nickt); users
 with modify permissions: Set(nickt)
 15/03/29 22:05:05 INFO Slf4jLogger: Slf4jLogger started
 15/03/29 22:05:05 INFO Utils: Successfully started service 'Driver' on port
 44201.
 15/03/29 22:05:05 INFO WorkerWatcher: Connecting to worker
 akka.tcp://sparkwor...@host5.domain.ex:40830/user/Worker
 15/03/29 22:05:05 INFO SparkContext: Running Spark version 1.3.0
 15/03/29 22:05:05 INFO SecurityManager: Changing view acls to: nickt
 15/03/29 22:05:05 INFO SecurityManager: Changing modify acls to: nickt
 15/03/29 22:05:05 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(nickt); users
 with modify permissions: Set(nickt)
 15/03/29 22:05:05 INFO Slf4jLogger: Slf4jLogger started
 15/03/29 22:05:05 INFO Utils: Successfully started service 'sparkDriver' on
 port 33382.
 15/03/29 22:05:05 INFO SparkEnv: Registering MapOutputTracker
 15/03/29 22:05:05 INFO SparkEnv: Registering BlockManagerMaster
 15/03/29 22:05:05 INFO DiskBlockManager: Created local directory at
 /tmp/spark-9c52eb1e-92b9-4e3f-b0e9-699a158f8e40/blockmgr-222a2522-a0fc-4535-a939-4c14d92dc666
 15/03/29 22:05:05 INFO WorkerWatcher: Successfully connected to
 akka.tcp://sparkwor...@host5.domain.ex:40830/user/Worker
 15/03/29 22:05:05 INFO MemoryStore: MemoryStore started with capacity 265.1
 MB
 15/03/29 22:05:05 INFO HttpFileServer: HTTP File server directory is
 /tmp/spark-031afddd-2a75-4232-931a-89e502b0d722/httpd-7e22bb57-3cfe-4c89-aaec-4e6ca1a65f66
 15/03/29 22:05:05 INFO HttpServer: Starting HTTP Server
 15/03/29 22:05:05 INFO Server: jetty-8.y.z-SNAPSHOT
 15/03/29 22:05:05 INFO AbstractConnector: Started
 SocketConnector@0.0.0.0:42484
 15/03/29 22:05:05 INFO Utils: Successfully started service

RE: 2 input paths generate 3 partitions

2015-03-27 Thread java8964

The files sound too small to be 2 blocks in HDFS.
Did you set the defaultParallelism to be 3 in your spark?
Yong

Subject: Re: 2 input paths generate 3 partitions
From: zzh...@hortonworks.com
To: rvern...@gmail.com
CC: user@spark.apache.org
Date: Fri, 27 Mar 2015 23:15:38 +

Hi Rares,

The number of partition is controlled by HDFS input format, and one file may 
have multiple partitions if it consists of multiple block. In you case, I think 
there is one file with 2 splits.

Thanks.

Zhan Zhang

On Mar 27, 2015, at 3:12 PM, Rares Vernica rvern...@gmail.com wrote:

Hello,

I am using the Spark shell in Scala on the localhost. I am using 
sc.textFile to read a directory. The directory looks like this (generated by 
another Spark script):

part-0
part-1
_SUCCESS

The part-0 has four short lines of text while
part-1 has two short lines of text. The
_SUCCESS file is empty. When I check the number of partitions on the RDD I get:

scala foo.partitions.length
15/03/27 14:57:31 INFO FileInputFormat: Total input paths to process : 2
res68: Int = 3

I wonder why do the two input files generate three partitions. Does Spark check 
the number of lines in each file and try to generate three balanced partitions?

Thanks!
Rares

RE: Why I didn't see the benefits of using KryoSerializer

2015-03-20 Thread java8964

Hi, Imran:
Thanks for your information.
I found a benchmark online about serialization which compares Java vs Kryo vs 
gridgain at here: 
http://gridgain.blogspot.com/2012/12/java-serialization-good-fast-and-faster.html
From my test result, in the above benchmark case for the SimpleObject, Kryo is 
slightly faster than Java serialization, but only use half of the space vs 
Java serialization.
So now I understand more about what kind of benefits I should expect from using 
KryoSerializer.
But I have some questions related to Spark SQL. If I use Spark SQL, should I 
expect less memory usage? I mean in Spark SQL, everything is controlled by 
Spark. If I pass in 
-Dspark.serializer=org.apache.spark.serializer.KryoSerializer and save the 
table in Cache, so it will use much less memory? Do I also need to specify 
StorageLevel.MEMORY_ONLY_SER if I want to use less memory? Where I can set 
that in Spark SQL?
Thanks
Yong

From: iras...@cloudera.com
Date: Fri, 20 Mar 2015 11:54:38 -0500
Subject: Re: Why I didn't see the benefits of using KryoSerializer
To: java8...@hotmail.com
CC: user@spark.apache.org

Hi Yong,
yes I think your analysis is correct.  I'd imagine almost all serializers out 
there will just convert a string to its utf-8 representation.  You might be 
interested in adding compression on top of a serializer, which would probably 
bring the string size down in almost all cases, but then you also need to take 
the time for compression.  Kryo is generally more efficient than the java 
serializer on complicated object types.
I guess I'm still a little surprised that kryo is slower than java 
serialization for you.  You might try setting spark.kryo.referenceTracking to 
false if you are just serializing objects with no circular references.  I think 
that will improve the performance a little, though I dunno how much.
It might be worth running your experiments again with slightly more complicated 
objects and see what you observe.
Imran

On Thu, Mar 19, 2015 at 12:57 PM, java8964 java8...@hotmail.com wrote:



I read the Spark code a little bit, trying to understand my own question.
It looks like the different is really between 
org.apache.spark.serializer.JavaSerializer and 
org.apache.spark.serializer.KryoSerializer, both having the method named 
writeObject.
In my test case, for each line of my text file, it is about 140 bytes of 
String. When either JavaSerializer.writeObject(140 bytes of String) or 
KryoSerializer.writeObject(140 bytes of String), I didn't see difference in the 
underline OutputStream space usage.
Does this mean that KryoSerializer really doesn't give us any benefit for 
String type? I understand that for primitives types, it shouldn't have any 
benefits, but how about String type?
When we talk about lower the memory using KryoSerializer in spark, under what 
case it can bring significant benefits? It is my first experience with the 
KryoSerializer, so maybe I am total wrong about its usage.
Thanks
Yong 
From: java8...@hotmail.com
To: user@spark.apache.org
Subject: Why I didn't see the benefits of using KryoSerializer
Date: Tue, 17 Mar 2015 12:01:35 -0400




Hi, I am new to Spark. I tried to understand the memory benefits of using 
KryoSerializer.
I have this one box standalone test environment, which is 24 cores with 24G 
memory. I installed Hadoop 2.2 plus Spark 1.2.0.
I put one text file in the hdfs about 1.2G.  Here is the settings in the 
spark-env.sh
export SPARK_MASTER_OPTS=-Dspark.deploy.defaultCores=4export 
SPARK_WORKER_MEMORY=32gexport SPARK_DRIVER_MEMORY=2gexport 
SPARK_EXECUTOR_MEMORY=4g
First test case:val 
log=sc.textFile(hdfs://namenode:9000/test_1g/)log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)log.count()log.count()
The data is about 3M rows. For the first test case, from the storage in the web 
UI, I can see Size in Memory is 1787M, and Fraction Cached is 70% with 7 
cached partitions.This matched with what I thought, and first count finished 
about 17s, and 2nd count finished about 6s.
2nd test case after restart the spark-shell:val 
log=sc.textFile(hdfs://namenode:9000/test_1g/)log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER)log.count()log.count()
Now from the web UI, I can see Size in Memory is 1231M, and Fraction Cached 
is 100% with 10 cached partitions. It looks like caching the default java 
serialized format reduce the memory usage, but coming with a cost that first 
count finished around 39s and 2nd count finished around 9s. So the job runs 
slower, with less memory usage.
So far I can understand all what happened and the tradeoff.
Now the problem comes with when I tried to test with KryoSerializer
SPARK_JAVA_OPTS=-Dspark.serializer=org.apache.spark.serializer.KryoSerializer 
/opt/spark/bin/spark-shellval 
log=sc.textFile(hdfs://namenode:9000/test_1g/)log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER)log.count()log.count()
First, I saw that the new serializer setting passed in, as proven in the Spark

RE: com.esotericsoftware.kryo.KryoException: java.io.IOException: File too large vs FileNotFoundException (Too many open files) on spark 1.2.1

2015-03-20 Thread java8964

Do you think the ulimit for the user running Spark on your nodes?
Can you run ulimit -a under the user who is running spark on the executor 
node? Does the result make sense for the data you are trying to process?
Yong
From: szheng.c...@gmail.com
To: user@spark.apache.org
Subject: com.esotericsoftware.kryo.KryoException: java.io.IOException: File too 
large vs FileNotFoundException (Too many open files) on spark 1.2.1
Date: Fri, 20 Mar 2015 15:28:26 -0400

Hi All, I try to run a simple sort by on 1.2.1. And it always give me below two 
errors: 1, 15/03/20 17:48:29 WARN TaskSetManager: Lost task 2.0 in stage 1.0 
(TID 35, ip-10-169-217-47.ec2.internal): java.io.FileNotFoundException: 
/tmp/spark-e40bb112-3a08-4f62-9eaa-cd094fcfa624/spark-58f72d53-8afc-41c2-ad6b-e96b479b51f5/spark-fde6da79-0b51-4087-8234-2c07ac6d7586/spark-dd7d6682-19dd-4c66-8aa5-d8a4abe88ca2/16/temp_shuffle_756b59df-ef3a-4680-b3ac-437b53267826
 (Too many open files) And then I switch 
to:conf.set(spark.shuffle.consolidateFiles, 
true).set(spark.shuffle.manager, SORT) Then I get the error: Exception in 
thread main org.apache.spark.SparkException: Job aborted due to stage 
failure: Task 5 in stage 1.0 failed 4 times, most recent failure: Lost task 5.3 
in stage 1.0 (TID 36, ip-10-169-217-47.ec2.internal): 
com.esotericsoftware.kryo.KryoException: java.io.IOException: File too large
at com.esotericsoftware.kryo.io.Output.flush(Output.java:157) I roughly 
know the first issue is because Spark shuffle creates too many local temp files 
(and I don’t know the solution, because looks like my solution also cause other 
issues), but I am not sure what means is the second error.  Anyone knows the 
solution for both cases? Regards, Shuai

RE: Why I didn't see the benefits of using KryoSerializer

2015-03-19 Thread java8964

I read the Spark code a little bit, trying to understand my own question.
It looks like the different is really between 
org.apache.spark.serializer.JavaSerializer and 
org.apache.spark.serializer.KryoSerializer, both having the method named 
writeObject.
In my test case, for each line of my text file, it is about 140 bytes of 
String. When either JavaSerializer.writeObject(140 bytes of String) or 
KryoSerializer.writeObject(140 bytes of String), I didn't see difference in the 
underline OutputStream space usage.
Does this mean that KryoSerializer really doesn't give us any benefit for 
String type? I understand that for primitives types, it shouldn't have any 
benefits, but how about String type?
When we talk about lower the memory using KryoSerializer in spark, under what 
case it can bring significant benefits? It is my first experience with the 
KryoSerializer, so maybe I am total wrong about its usage.
Thanks
Yong 
From: java8...@hotmail.com
To: user@spark.apache.org
Subject: Why I didn't see the benefits of using KryoSerializer
Date: Tue, 17 Mar 2015 12:01:35 -0400




Hi, I am new to Spark. I tried to understand the memory benefits of using 
KryoSerializer.
I have this one box standalone test environment, which is 24 cores with 24G 
memory. I installed Hadoop 2.2 plus Spark 1.2.0.
I put one text file in the hdfs about 1.2G.  Here is the settings in the 
spark-env.sh
export SPARK_MASTER_OPTS=-Dspark.deploy.defaultCores=4export 
SPARK_WORKER_MEMORY=32gexport SPARK_DRIVER_MEMORY=2gexport 
SPARK_EXECUTOR_MEMORY=4g
First test case:val 
log=sc.textFile(hdfs://namenode:9000/test_1g/)log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)log.count()log.count()
The data is about 3M rows. For the first test case, from the storage in the web 
UI, I can see Size in Memory is 1787M, and Fraction Cached is 70% with 7 
cached partitions.This matched with what I thought, and first count finished 
about 17s, and 2nd count finished about 6s.
2nd test case after restart the spark-shell:val 
log=sc.textFile(hdfs://namenode:9000/test_1g/)log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER)log.count()log.count()
Now from the web UI, I can see Size in Memory is 1231M, and Fraction Cached 
is 100% with 10 cached partitions. It looks like caching the default java 
serialized format reduce the memory usage, but coming with a cost that first 
count finished around 39s and 2nd count finished around 9s. So the job runs 
slower, with less memory usage.
So far I can understand all what happened and the tradeoff.
Now the problem comes with when I tried to test with KryoSerializer
SPARK_JAVA_OPTS=-Dspark.serializer=org.apache.spark.serializer.KryoSerializer 
/opt/spark/bin/spark-shellval 
log=sc.textFile(hdfs://namenode:9000/test_1g/)log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER)log.count()log.count()
First, I saw that the new serializer setting passed in, as proven in the Spark 
Properties of Environment shows 











spark.driver.extraJavaOptions


  -Dspark.serializer=org.apache.spark.serializer.KryoSerializer



. This is not there for first 2 test cases.But in the web UI of Storage, the 
Size in Memory is 1234M, with 100% Fraction Cached and 10 cached 
partitions. The first count took 46s and 2nd count took 23s.
I don't get much less memory size as I expected, but longer run time for both 
counts. Anything I did wrong? Why the memory foot print of MEMORY_ONLY_SER 
for KryoSerializer still use the same size as default Java serializer, with 
worse duration?
Thanks
Yong

RE: mapPartitions - How Does it Works

2015-03-18 Thread java8964

Here is what I think:
mapPartitions is for a specialized map that is called only once for each 
partition. The entire content of the respective partitions is available as a 
sequential stream of values via the input argument (Iterarator[T]). The 
combined result iterators are automatically converted into a new RDD.
So in this case, the RDD (1,2,, 10) is split as 3 partitions, (1,2,3), 
(4,5,6), (7,8,9,10).
For every partition, your function is the get the first element as x.next, 
using it to build a list, return the iterator from the List.
So each partition will return (1), (4) and (7) as 3 iterator, then combine to 
one final RDD (1, 4, 7).
Yong

 Date: Wed, 18 Mar 2015 10:19:34 -0700
 From: ashish.us...@gmail.com
 To: user@spark.apache.org
 Subject: mapPartitions - How Does it Works
 
 I am trying to understand about mapPartitions but i am still not sure how it
 works
 
 in the below example it create three partition 
 val parallel = sc.parallelize(1 to 10, 3)
 
 and when we do below 
 parallel.mapPartitions( x = List(x.next).iterator).collect
 
 it prints value 
 Array[Int] = Array(1, 4, 7)
 
 Can some one please explain why it prints 1,4,7 only
 
 Thanks,
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/mapPartitions-How-Does-it-Works-tp22123.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Why I didn't see the benefits of using KryoSerializer

2015-03-17 Thread java8964

Hi, I am new to Spark. I tried to understand the memory benefits of using
KryoSerializer.
I have this one box standalone test environment, which is 24 cores with 24G
memory. I installed Hadoop 2.2 plus Spark 1.2.0.
I put one text file in the hdfs about 1.2G. Here is the settings in the
spark-env.sh
export SPARK_MASTER_OPTS=-Dspark.deploy.defaultCores=4export
SPARK_WORKER_MEMORY=32gexport SPARK_DRIVER_MEMORY=2gexport
SPARK_EXECUTOR_MEMORY=4g
First test case:val
log=sc.textFile(hdfs://namenode:9000/test_1g/)log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)log.count()log.count()
The data is about 3M rows. For the first test case, from the storage in the web
UI, I can see Size in Memory is 1787M, and Fraction Cached is 70% with 7
cached partitions.This matched with what I thought, and first count finished
about 17s, and 2nd count finished about 6s.
2nd test case after restart the spark-shell:val
log=sc.textFile(hdfs://namenode:9000/test_1g/)log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER)log.count()log.count()
Now from the web UI, I can see Size in Memory is 1231M, and Fraction Cached
is 100% with 10 cached partitions. It looks like caching the default java
serialized format reduce the memory usage, but coming with a cost that first
count finished around 39s and 2nd count finished around 9s. So the job runs
slower, with less memory usage.
So far I can understand all what happened and the tradeoff.
Now the problem comes with when I tried to test with KryoSerializer
SPARK_JAVA_OPTS=-Dspark.serializer=org.apache.spark.serializer.KryoSerializer
/opt/spark/bin/spark-shellval
log=sc.textFile(hdfs://namenode:9000/test_1g/)log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER)log.count()log.count()
First, I saw that the new serializer setting passed in, as proven in the Spark
Properties of Environment shows

spark.driver.extraJavaOptions

-Dspark.serializer=org.apache.spark.serializer.KryoSerializer

. This is not there for first 2 test cases.But in the web UI of Storage, the
Size in Memory is 1234M, with 100% Fraction Cached and 10 cached
partitions. The first count took 46s and 2nd count took 23s.
I don't get much less memory size as I expected, but longer run time for both
counts. Anything I did wrong? Why the memory foot print of MEMORY_ONLY_SER
for KryoSerializer still use the same size as default Java serializer, with
worse duration?
Thanks
Yong

RE: can spark take advantage of ordered data?

2015-03-11 Thread java8964

RangePartitioner?
At least for join, you can implement your own partitioner, to utilize the 
sorted data.
Just my 2 cents.
Date: Wed, 11 Mar 2015 17:38:04 -0400
Subject: can spark take advantage of ordered data?
From: jcove...@gmail.com
To: User@spark.apache.org

Hello all,
I am wondering if spark already has support for optimizations on sorted data 
and/or if such support could be added (I am comfortable dropping to a lower 
level if necessary to implement this, but I'm not sure if it is possible at 
all).
Context: we have a number of data sets which are essentially already sorted on 
a key. With our current systems, we can take advantage of this to do a lot of 
analysis in a very efficient fashion...merges and joins, for example, can be 
done very efficiently, as can folds on a secondary key and so on.
I was wondering if spark would be a fit for implementing these sorts of 
optimizations? Obviously it is sort of a niche case, but would this be 
achievable? Any pointers on where I should look?

RE: Spark SQL using Hive metastore

2015-03-11 Thread java8964

You need to include the Hadoop native library in your spark-shell/spark-sql, 
assuming your hadoop native library including native snappy library.
spark-sql --driver-library-path point_to_your_hadoop_native_library
In spark-sql, you can just use any command as you are in Hive CLI.
Yong

Date: Wed, 11 Mar 2015 21:06:54 +
From: rgra...@yahoo.com.INVALID
To: user@spark.apache.org
Subject: Spark SQL using Hive metastore

Hi guys,
I am a newbie in running Spark SQL / Spark. My goal is to run some TPC-H 
queries atop Spark SQL using Hive metastore. 
It looks like spark 1.2.1 release has Spark SQL / Hive support. However, I am 
not able to fully connect all the dots. 

I did the following: 
1. Copied hive-site.xml from hive to spark/conf2. Copied mysql connector to 
spark/lib3. I have started hive metastore service: hive --service metastore
3. I have started ./bin/spark-sql 
4. I typed: spark-sql show tables; However, the following error was thrown:  
Job 0 failed: collect at SparkPlan.scala:84, took 0.241788 s15/03/11 15:02:35 
ERROR SparkSQLDriver: Failed in [show tables]org.apache.spark.SparkException: 
Job aborted due to stage failure: Task serialization failed: 
org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] no native 
library is found for os.name=Linux and os.arch=aarch64
Do  you know what I am doing wrong ? I mention that I have hive-0.14 instead of 
hive-0.13. 

And another question: What is the right command to run sql queries with spark 
sql using hive metastore ?
Thanks,Robert

RE: sc.textFile() on windows cannot access UNC path

2015-03-10 Thread java8964

I think the work around is clear.
Using JDK 7, and implement your own saveAsRemoteWinText() using java.nio.path.
Yong

From: ningjun.w...@lexisnexis.com
To: java8...@hotmail.com; user@spark.apache.org
Subject: RE: sc.textFile() on windows cannot access UNC path
Date: Tue, 10 Mar 2015 03:02:37 +









Hi Yong
 
Thanks for the reply. Yes it works with local drive letter. But I really need 
to use UNC path because the path is input from at runtime. I cannot dynamically 
assign a drive letter
 to arbitrary UNC path at runtime.
 
Is there any work around that I can use UNC path for sc.textFile(…)?

 
 

Ningjun
 

 


From: java8964 [mailto:java8...@hotmail.com]


Sent: Monday, March 09, 2015 5:33 PM

To: Wang, Ningjun (LNG-NPV); user@spark.apache.org

Subject: RE: sc.textFile() on windows cannot access UNC path


 

This is a Java problem, not really Spark.

 


From this page: 
http://stackoverflow.com/questions/18520972/converting-java-file-url-to-file-path-platform-independent-including-u


 


You can see that using Java.nio.* on JDK 7, it will fix this issue. But Path 
class in Hadoop will use java.io.*, instead of java.nio.


 


You need to manually mount your windows remote share a local driver, like Z:, 
then it should work.


 


Yong




From:
ningjun.w...@lexisnexis.com

To: user@spark.apache.org

Subject: sc.textFile() on windows cannot access UNC path

Date: Mon, 9 Mar 2015 21:09:38 +

I am running Spark on windows 2008 R2. I use sc.textFile() to load text file  
using UNC path, it does not work.
 
sc.textFile(rawfile:10.196.119.230/folder1/abc.txt,
4).count()

 
Input path does not exist: file:/10.196.119.230/folder1/abc.txt
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
file:/10.196.119.230/tar/Enron/enron-207-short.load
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328)
at org.apache.spark.rdd.RDD.count(RDD.scala:910)
at 
ltn.analytics.tests.IndexTest$$anonfun$3.apply$mcV$sp(IndexTest.scala:104)
at 
ltn.analytics.tests.IndexTest$$anonfun$3.apply(IndexTest.scala:103)
at 
ltn.analytics.tests.IndexTest$$anonfun$3.apply(IndexTest.scala:103)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
at org.scalatest.FunSuiteLike$class.runTests

RE: Compilation error

2015-03-10 Thread java8964

Or another option is to use Scala-IDE, which is built on top of Eclipse, 
instead of pure Eclipse, so Scala comes with it.
Yong

 From: so...@cloudera.com
 Date: Tue, 10 Mar 2015 18:40:44 +
 Subject: Re: Compilation error
 To: mohitanch...@gmail.com
 CC: t...@databricks.com; user@spark.apache.org

 A couple points:

 You've got mismatched versions here -- 1.2.0 vs 1.2.1. You should fix
 that but it's not your problem.

 These are also supposed to be 'provided' scope dependencies in Maven.

 You should get the Scala deps transitively and can import scala.*
 classes. However, it would be a little bit more correct to depend
 directly on the scala library classes, but in practice, easiest not to
 in simple use cases.

 If you're still having trouble look at the output of mvn dependency:tree

 On Tue, Mar 10, 2015 at 6:32 PM, Mohit Anchlia mohitanch...@gmail.com wrote:
  I am using maven and my dependency looks like this, but this doesn't seem to
  be working

  dependencies

  dependency

  groupIdorg.apache.spark/groupId

  artifactIdspark-streaming_2.10/artifactId

  version1.2.0/version

  /dependency

  dependency

  groupIdorg.apache.spark/groupId

  artifactIdspark-core_2.10/artifactId

  version1.2.1/version

  /dependency

  /dependencies

  On Tue, Mar 10, 2015 at 11:06 AM, Tathagata Das t...@databricks.com wrote:

  If you are using tools like SBT/Maven/Gradle/etc, they figure out all the
  recursive dependencies and includes them in the class path. I haven't
  touched Eclipse in years so I am not sure off the top of my head what's
  going on instead. Just in case you only downloaded the
  spark-streaming_2.10.jar  then that is indeed insufficient and you have to
  download all the recursive dependencies. May be you should create a Maven
  project inside Eclipse?

  TD

  On Tue, Mar 10, 2015 at 11:00 AM, Mohit Anchlia mohitanch...@gmail.com
  wrote:

  How do I do that? I haven't used Scala before.

  Also, linking page doesn't mention that:

  http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#linking

  On Tue, Mar 10, 2015 at 10:57 AM, Sean Owen so...@cloudera.com wrote:

  It means you do not have Scala library classes in your project
  classpath.

  On Tue, Mar 10, 2015 at 5:54 PM, Mohit Anchlia mohitanch...@gmail.com
  wrote:
   I am trying out streaming example as documented and I am using spark
   1.2.1
   streaming from maven for Java.

   When I add this code I get compilation error on and eclipse is not
   able to
   recognize Tuple2. I also don't see any import scala.Tuple2 class.

   http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#a-quick-example

   private void map(JavaReceiverInputDStreamString lines) {

   JavaDStreamString words = lines.flatMap(

   new FlatMapFunctionString, String() {

   @Override public IterableString call(String x) {

   return Arrays.asList(x.split( ));

   }

   });

   // Count each word in each batch

   JavaPairDStreamString, Integer pairs = words.map(

   new PairFunctionString, String, Integer() {

   @Override public Tuple2String, Integer call(String s) throws
   Exception {

   return new Tuple2String, Integer(s, 1);

   }

   });

   }

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

RE: sc.textFile() on windows cannot access UNC path

2015-03-09 Thread java8964

This is a Java problem, not really Spark.
From this page: 
http://stackoverflow.com/questions/18520972/converting-java-file-url-to-file-path-platform-independent-including-u
You can see that using Java.nio.* on JDK 7, it will fix this issue. But Path 
class in Hadoop will use java.io.*, instead of java.nio.
You need to manually mount your windows remote share a local driver, like Z:, 
then it should work.
Yong

From: ningjun.w...@lexisnexis.com
To: user@spark.apache.org
Subject: sc.textFile() on windows cannot access UNC path
Date: Mon, 9 Mar 2015 21:09:38 +









I am running Spark on windows 2008 R2. I use sc.textFile() to load text file  
using UNC path, it does not work.
 
sc.textFile(rawfile:10.196.119.230/folder1/abc.txt,
4).count()

 
Input path does not exist: file:/10.196.119.230/folder1/abc.txt
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
file:/10.196.119.230/tar/Enron/enron-207-short.load
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328)
at org.apache.spark.rdd.RDD.count(RDD.scala:910)
at 
ltn.analytics.tests.IndexTest$$anonfun$3.apply$mcV$sp(IndexTest.scala:104)
at 
ltn.analytics.tests.IndexTest$$anonfun$3.apply(IndexTest.scala:103)
at 
ltn.analytics.tests.IndexTest$$anonfun$3.apply(IndexTest.scala:103)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
at org.scalatest.Suite$class.run(Suite.scala:1424)
at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
at 
ltn.analytics.tests.IndexTest.org$scalatest$BeforeAndAfterAll$$super$run(IndexTest.scala:15)
at 
org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
at

From Spark web ui, how to prove the parquet column pruning working

2015-03-09 Thread java8964

Hi, Currently most of the data in our production is using Avro + Snappy. I want 
to test the benefits if we store the data in Parquet format. I changed the our 
ETL to generate the Parquet format, instead of Avor, and want to test a simple 
sql in Spark SQL, to verify the benefits from Parquet.
I generated the same dataset in both Avro and Parquet in HDFS, and load them 
both in Spark-SQL. Now I run the same query like select colum1 from 
src_table_avro/parqut where colum2=xxx, I can see that for the parquet data 
format, the job runs much fast. The test files size for both format are around 
930M. So Avro job generated 8 tasks to read the data with 21s as the median 
duration, vs parquet job generate 7 tasks to read the data with 0.4s as the 
median duration.
Since the dataset has more than 100 columns, I can see the parquet file really 
coming with fast read. But my question is that from the spark UI, both job show 
900M as the input size, and 0 for rest, in this case, how do I know column 
pruning really works? I think it is due to that, so parquet file can be read so 
fast, but is there any statistic can prove that to me on the Spark UI? 
Something like the input total file size is 900M, but only 10M really read due 
to column pruning? So in case that the columns pruning not work in parquet due 
to what kind of SQL query, I can identify in the first place.
Thanks
Yong

RE: Help me understand the partition, parallelism in Spark

2015-02-26 Thread java8964

Anyone can share any thoughts related to my questions?
Thanks

From: java8...@hotmail.com
To: user@spark.apache.org
Subject: Help me understand the partition, parallelism in Spark
Date: Wed, 25 Feb 2015 21:58:55 -0500

Hi, Sparkers:
I come from the Hadoop MapReducer world, and try to understand some internal 
information of spark. From the web and this list, I keep seeing people talking 
about increase the parallelism if you get the OOM error. I tried to read 
document as much as possible to understand the RDD partition, and parallelism 
usage in the spark.
I understand that for RDD from HDFS, by default, one partition will be one HDFS 
block, pretty straightforward. I saw that lots of RDD operations support 2nd 
parameter of parallelism. This is the part confuse me. From my understand, the 
parallelism is totally controlled by how many cores you give to your job. 
Adjust that parameter, or spark.default.parallelism shouldn't have any impact.
For example, if I have a 10G data in HDFS, and assume the block size is 128M, 
so we get 100 blocks/partitions in RDD. Now if I transfer that RDD to a Pair 
RDD, with 1000 unique keys in the pair RDD, and doing reduceByKey action, using 
200 as the default parallelism. Here is what I assume:
We have 100 partitions, as the data comes from 100 blocks. Most likely the 
spark will generate 100 tasks to read and shuffle them?The 1000 unique keys 
mean the 1000 reducer group, like in MRIf I set the max core to be 50, so there 
will be up to 50 tasks can be run concurrently. The rest tasks just have to 
wait for the core, if there are 50 tasks are running.Since we are doing 
reduceByKey, shuffling will happen. Data will be shuffled into 1000 partitions, 
as we have 1000 unique keys.I don't know these 1000 partitions will be 
processed by how many tasks, maybe this is the parallelism parameter comes 
in?No matter what parallelism this will be, there are ONLY 50 task can be run 
concurrently. So if we set more cores, more partitions' data will be processed 
in the executor (which runs more thread in this case), so more memory needs. I 
don't see how increasing parallelism could help the OOM in this case.In my test 
case of Spark SQL, I gave 24G as the executor heap, my join between 2 big 
datasets keeps getting OOM. I keep increasing the spark.default.parallelism, 
from 200 to 400, to 2000, even to 4000, no help. What really makes the query 
finish finally without OOM is after I change the --total-executor-cores from 
10 to 4.
So my questions are:1) What is the parallelism really mean in the Spark? In the 
simple example above, for reduceByKey, what difference it is between 
parallelism change from 10 to 20?2) When we talk about partition in the spark, 
for the data coming from HDFS, I can understand the partition clearly. For the 
intermediate data, the partition will be same as key, right? For group, 
reducing, join action, uniqueness of the keys will be partition. Is that 
correct?3) Why increasing parallelism could help OOM? I don't get this part. 
From my limited experience, adjusting the core count really matters for memory.
Thanks
Yong

RE: Help me understand the partition, parallelism in Spark

2015-02-26 Thread java8964

Imran, thanks for your explaining about the parallelism. That is very helpful.
In my test case, I am only use one box cluster, with one executor. So if I put 
10 cores, then 10 concurrent task will be run within this one executor, which 
will handle more data than 4 core case, then leaded to OOM.
I haven't setup Spark on our production cluster yet, but assume that we have 
100 nodes cluster, if I guess right, set up to 1000 cores mean that on  
average, each box's executor will run 10 threads to process data. So lowering 
core will reduce the speed of spark, but can help to avoid the OOM, as less 
data to be processed in the memory.
My another guess is that each partition will be processed by one core 
eventually. So make bigger partition count can decrease partition size, which 
should help the memory footprint. In my case, I guess that Spark SQL in fact 
doesn't use the spark.default.parallelism setting, or at least in my query, 
it is not used. So no matter what I changed, it doesn't matter. The reason I 
said that is that there is always 200 tasks in stage 2 and 3 of my query job, 
no matter what I set the spark.default.parallelism.
I think lowering the core is to exchange lower memory usage vs speed. Hope my 
understanding is correct.
Thanks
Yong
Date: Thu, 26 Feb 2015 17:03:20 -0500
Subject: Re: Help me understand the partition, parallelism in Spark
From: yana.kadiy...@gmail.com
To: iras...@cloudera.com
CC: java8...@hotmail.com; user@spark.apache.org

Imran, I have also observed the phenomenon of reducing the cores helping with 
OOM. I wanted to ask this (hopefully without straying off topic): we can 
specify the number of cores and the executor memory. But we don't get to 
specify _how_ the cores are spread among executors.
Is it possible that with 24G memory and 4 cores we get a spread of 1 core per 
executor thus ending up with 24G for the task, but with 24G memory and 10 cores 
some executor ends up with 3 cores on the same machine and thus we have only 8G 
per task?
On Thu, Feb 26, 2015 at 4:42 PM, Imran Rashid iras...@cloudera.com wrote:
Hi Yong,
mostly correct except for:Since we are doing reduceByKey, shuffling will 
happen. Data will be shuffled into 1000 partitions, as we have 1000 unique 
keys.no, you will not get 1000 partitions.  Spark has to decide how many 
partitions to use before it even knows how many unique keys there are.  If you 
have 200 as the default parallelism (or you just explicitly make it the second 
parameter to reduceByKey()), then you will get 200 partitions.  The 1000 unique 
keys will be distributed across the 200 partitions.  ideally they will be 
distributed pretty equally, but how they get distributed depends on the 
partitioner (by default you will have a HashPartitioner, so it depends on the 
hash of your keys).
Note that this is more or less the same as in Hadoop MapReduce.
the amount of parallelism matters b/c there are various places in spark where 
there is some overhead proportional to the size of a partition.  So in your 
example, if you have 1000 unique keys in 200 partitions, you expect about 5 
unique keys per partitions -- if instead you had 10 partitions, you'd expect 
100 unique keys per partitions, and thus more data and you'd be more likely to 
hit an OOM.  But there are many other possible sources of OOM, so this is 
definitely not the *only* solution.
Sorry I can't comment in particular about Spark SQL -- hopefully somebody more 
knowledgeable can comment on that.


On Wed, Feb 25, 2015 at 8:58 PM, java8964 java8...@hotmail.com wrote:



Hi, Sparkers:
I come from the Hadoop MapReducer world, and try to understand some internal 
information of spark. From the web and this list, I keep seeing people talking 
about increase the parallelism if you get the OOM error. I tried to read 
document as much as possible to understand the RDD partition, and parallelism 
usage in the spark.
I understand that for RDD from HDFS, by default, one partition will be one HDFS 
block, pretty straightforward. I saw that lots of RDD operations support 2nd 
parameter of parallelism. This is the part confuse me. From my understand, the 
parallelism is totally controlled by how many cores you give to your job. 
Adjust that parameter, or spark.default.parallelism shouldn't have any impact.
For example, if I have a 10G data in HDFS, and assume the block size is 128M, 
so we get 100 blocks/partitions in RDD. Now if I transfer that RDD to a Pair 
RDD, with 1000 unique keys in the pair RDD, and doing reduceByKey action, using 
200 as the default parallelism. Here is what I assume:
We have 100 partitions, as the data comes from 100 blocks. Most likely the 
spark will generate 100 tasks to read and shuffle them?The 1000 unique keys 
mean the 1000 reducer group, like in MRIf I set the max core to be 50, so there 
will be up to 50 tasks can be run concurrently. The rest tasks just have to 
wait for the core, if there are 50 tasks are running.Since we are doing 
reduceByKey

Help me understand the partition, parallelism in Spark

2015-02-25 Thread java8964

Hi, Sparkers:
I come from the Hadoop MapReducer world, and try to understand some internal 
information of spark. From the web and this list, I keep seeing people talking 
about increase the parallelism if you get the OOM error. I tried to read 
document as much as possible to understand the RDD partition, and parallelism 
usage in the spark.
I understand that for RDD from HDFS, by default, one partition will be one HDFS 
block, pretty straightforward. I saw that lots of RDD operations support 2nd 
parameter of parallelism. This is the part confuse me. From my understand, the 
parallelism is totally controlled by how many cores you give to your job. 
Adjust that parameter, or spark.default.parallelism shouldn't have any impact.
For example, if I have a 10G data in HDFS, and assume the block size is 128M, 
so we get 100 blocks/partitions in RDD. Now if I transfer that RDD to a Pair 
RDD, with 1000 unique keys in the pair RDD, and doing reduceByKey action, using 
200 as the default parallelism. Here is what I assume:
We have 100 partitions, as the data comes from 100 blocks. Most likely the 
spark will generate 100 tasks to read and shuffle them?The 1000 unique keys 
mean the 1000 reducer group, like in MRIf I set the max core to be 50, so there 
will be up to 50 tasks can be run concurrently. The rest tasks just have to 
wait for the core, if there are 50 tasks are running.Since we are doing 
reduceByKey, shuffling will happen. Data will be shuffled into 1000 partitions, 
as we have 1000 unique keys.I don't know these 1000 partitions will be 
processed by how many tasks, maybe this is the parallelism parameter comes 
in?No matter what parallelism this will be, there are ONLY 50 task can be run 
concurrently. So if we set more cores, more partitions' data will be processed 
in the executor (which runs more thread in this case), so more memory needs. I 
don't see how increasing parallelism could help the OOM in this case.In my test 
case of Spark SQL, I gave 24G as the executor heap, my join between 2 big 
datasets keeps getting OOM. I keep increasing the spark.default.parallelism, 
from 200 to 400, to 2000, even to 4000, no help. What really makes the query 
finish finally without OOM is after I change the --total-executor-cores from 
10 to 4.
So my questions are:1) What is the parallelism really mean in the Spark? In the 
simple example above, for reduceByKey, what difference it is between 
parallelism change from 10 to 20?2) When we talk about partition in the spark, 
for the data coming from HDFS, I can understand the partition clearly. For the 
intermediate data, the partition will be same as key, right? For group, 
reducing, join action, uniqueness of the keys will be partition. Is that 
correct?3) Why increasing parallelism could help OOM? I don't get this part. 
From my limited experience, adjusting the core count really matters for memory.
Thanks
Yong

RE: Spark performance tuning

2015-02-21 Thread java8964

Can someone share some ideas about how to tune the GC time?
Thanks

From: java8...@hotmail.com
To: user@spark.apache.org
Subject: Spark performance tuning
Date: Fri, 20 Feb 2015 16:04:23 -0500




Hi, 
I am new to Spark, and I am trying to test the Spark SQL performance vs Hive. I 
setup a standalone box, with 24 cores and 64G memory.
We have one SQL in mind to test. Here is the basically setup on this one box 
for the SQL we are trying to run:
1) Dataset 1, 6.6G AVRO file with snappy compression, which contains nest 
structure of 3 array of struct in AVRO2) Dataset2, 5G AVRO file with snappy 
compression3) Dataset3, 2.3M AVRO file with snappy compression.
The basic structure of the query is like this:

(selectxxxfromdataset1 lateral view outer explode(struct1) lateral view outer 
explode(struct2)where x )left outer join(select  from dataset2 lateral 
view explode(xxx) where )on left outer join(select xxx from dataset3 
where )on x
So overall what it does is 2 outer explode on dataset1, left outer join with 
explode of dataset2, then finally left outer join with dataset 3.
On this standalone box, I installed Hadoop 2.2 and Hive 0.12, and Spark 1.2.0.
Baseline, the above query can finish around 50 minutes in Hive 12, with 6 
mappers and 3 reducers, each with 1G max heap, in 3 rounds of MR jobs.
This is a very expensive query running in our production, of course with much 
bigger data set, every day. Now I want to see how fast Spark can do for the 
same query.
I am using the following settings, based on my understanding of Spark, for a 
fair test between it and Hive:
export SPARK_WORKER_MEMORY=32gexport SPARK_DRIVER_MEMORY=2g--executor-memory 9g 
--total-executor-cores 9
I am trying to run the one executor with 9 cores and max 9G heap, to make Spark 
use almost same resource we gave to the MapReduce. Here is the result without 
any additional configuration changes, running under Spark 1.2.0, using 
HiveContext in Spark SQL, to run the exactly same query:
The Spark SQL generated 5 stage of tasks, shown below:4   collect at 
SparkPlan.scala:84 +details  2015/02/20 10:48:46 26 s200/200
 3   mapPartitions at Exchange.scala:64 +details 2015/02/20 10:32:07 16 min  
200/200 1112.3 MB2   mapPartitions at Exchange.scala:64 
+details 2015/02/20 10:22:06 9 min  40/40   4.7 GB  22.2 GB1   
mapPartitions at Exchange.scala:64 +details 2015/02/20 10:22:06 1.9 min 50/50   
6.2 GB  2.8 GB0   mapPartitions at Exchange.scala:64 +details 
2015/02/20 10:22:06 6 s 2/2 2.3 MB  156.6 KB
So the wall time of whole query is 26s + 16m + 9m + 2m + 6s, around 28 minutes.
It is about 56% of originally time, not bad. But I want to know any tuning of 
Spark can make it even faster.
For stage 2 and 3, I observed that GC time is more and more expensive. 
Especially in stage 3, shown below:
For stage 3:Metric  Min 25th percentile Median  75th percentile 
MaxDuration20 s30 s35 s39 s
2.4 minGC Time 9 s 17 s20 s25 s
2.2 minShuffle Write   4.7 MB  4.9 MB  5.2 MB  6.1 MB  
8.3 MB
So in median, the GC time took overall 20s/35s = 57% of time.
First change I made is to add the following line in the 
spark-default.conf:spark.serializer org.apache.spark.serializer.KryoSerializer
My assumption is that using kryoSerializer, instead of default java serialize, 
will lower the memory footprint, should lower the GC pressure during runtime. I 
know the I changed the correct spark-default.conf, because if I were add 
spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps in the same file, I will see the GC usage in the stdout 
file. Of course, in this test, I didn't add that, as I want to only make one 
change a time.The result is almost the same, as using standard java serialize. 
The wall time is still 28 minutes, and in stage 3, the GC still took around 50 
to 60% of time, almost same result within min, median to max in stage 3, 
without any noticeable performance gain.
Next, based on my understanding, and for this test, I think the default 
spark.storage.memoryFraction is too high for this query, as there is no reason 
to reserve so much memory for caching data, Because we don't reuse any dataset 
in this one query. So I add this at the end of spark-shell command --conf 
spark.storage.memoryFraction=0.3, as I want to just reserve half of the memory 
for caching data vs first time. Of course, this time, I rollback the first 
change of KryoSerializer.
The result looks like almost the same. The whole query finished around 28s + 
14m + 9.6m + 1.9m + 6s = 27 minutes.
It looks like that Spark is faster than Hive, but is there any steps I can make 
it even faster? Why using KryoSerializer makes no difference? If I want to 
use the same resource as now, anything I can do to speed it up

Spark performance tuning

2015-02-20 Thread java8964

Hi, 
I am new to Spark, and I am trying to test the Spark SQL performance vs Hive. I 
setup a standalone box, with 24 cores and 64G memory.
We have one SQL in mind to test. Here is the basically setup on this one box 
for the SQL we are trying to run:
1) Dataset 1, 6.6G AVRO file with snappy compression, which contains nest 
structure of 3 array of struct in AVRO2) Dataset2, 5G AVRO file with snappy 
compression3) Dataset3, 2.3M AVRO file with snappy compression.
The basic structure of the query is like this:

(selectxxxfromdataset1 lateral view outer explode(struct1) lateral view outer 
explode(struct2)where x )left outer join(select  from dataset2 lateral 
view explode(xxx) where )on left outer join(select xxx from dataset3 
where )on x
So overall what it does is 2 outer explode on dataset1, left outer join with 
explode of dataset2, then finally left outer join with dataset 3.
On this standalone box, I installed Hadoop 2.2 and Hive 0.12, and Spark 1.2.0.
Baseline, the above query can finish around 50 minutes in Hive 12, with 6 
mappers and 3 reducers, each with 1G max heap, in 3 rounds of MR jobs.
This is a very expensive query running in our production, of course with much 
bigger data set, every day. Now I want to see how fast Spark can do for the 
same query.
I am using the following settings, based on my understanding of Spark, for a 
fair test between it and Hive:
export SPARK_WORKER_MEMORY=32gexport SPARK_DRIVER_MEMORY=2g--executor-memory 9g 
--total-executor-cores 9
I am trying to run the one executor with 9 cores and max 9G heap, to make Spark 
use almost same resource we gave to the MapReduce. Here is the result without 
any additional configuration changes, running under Spark 1.2.0, using 
HiveContext in Spark SQL, to run the exactly same query:
The Spark SQL generated 5 stage of tasks, shown below:4   collect at 
SparkPlan.scala:84 +details  2015/02/20 10:48:46 26 s200/200
 3   mapPartitions at Exchange.scala:64 +details 2015/02/20 10:32:07 16 min  
200/200 1112.3 MB2   mapPartitions at Exchange.scala:64 
+details 2015/02/20 10:22:06 9 min  40/40   4.7 GB  22.2 GB1   
mapPartitions at Exchange.scala:64 +details 2015/02/20 10:22:06 1.9 min 50/50   
6.2 GB  2.8 GB0   mapPartitions at Exchange.scala:64 +details 
2015/02/20 10:22:06 6 s 2/2 2.3 MB  156.6 KB
So the wall time of whole query is 26s + 16m + 9m + 2m + 6s, around 28 minutes.
It is about 56% of originally time, not bad. But I want to know any tuning of 
Spark can make it even faster.
For stage 2 and 3, I observed that GC time is more and more expensive. 
Especially in stage 3, shown below:
For stage 3:Metric  Min 25th percentile Median  75th percentile 
MaxDuration20 s30 s35 s39 s
2.4 minGC Time 9 s 17 s20 s25 s
2.2 minShuffle Write   4.7 MB  4.9 MB  5.2 MB  6.1 MB  
8.3 MB
So in median, the GC time took overall 20s/35s = 57% of time.
First change I made is to add the following line in the 
spark-default.conf:spark.serializer org.apache.spark.serializer.KryoSerializer
My assumption is that using kryoSerializer, instead of default java serialize, 
will lower the memory footprint, should lower the GC pressure during runtime. I 
know the I changed the correct spark-default.conf, because if I were add 
spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps in the same file, I will see the GC usage in the stdout 
file. Of course, in this test, I didn't add that, as I want to only make one 
change a time.The result is almost the same, as using standard java serialize. 
The wall time is still 28 minutes, and in stage 3, the GC still took around 50 
to 60% of time, almost same result within min, median to max in stage 3, 
without any noticeable performance gain.
Next, based on my understanding, and for this test, I think the default 
spark.storage.memoryFraction is too high for this query, as there is no reason 
to reserve so much memory for caching data, Because we don't reuse any dataset 
in this one query. So I add this at the end of spark-shell command --conf 
spark.storage.memoryFraction=0.3, as I want to just reserve half of the memory 
for caching data vs first time. Of course, this time, I rollback the first 
change of KryoSerializer.
The result looks like almost the same. The whole query finished around 28s + 
14m + 9.6m + 1.9m + 6s = 27 minutes.
It looks like that Spark is faster than Hive, but is there any steps I can make 
it even faster? Why using KryoSerializer makes no difference? If I want to 
use the same resource as now, anything I can do to speed it up more, especially 
lower the GC time?
Thanks
Yong

RangePartitioner in Spark 1.2.1

2015-02-17 Thread java8964

Hi, Sparkers:
I just happened to search in google for something related to the 
RangePartitioner of spark, and found an old thread in this email list as here:
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Partition-td991.html
I followed the code example mentioned in that email thread as following:
scala  import org.apache.spark.RangePartitionerimport 
org.apache.spark.RangePartitioner
scala val rdd = sc.parallelize(List(apple, Ball, cat, dog, Elephant, 
fox, gas, horse, index, jet, kitsch, long, moon, Neptune, 
ooze, Pen, quiet, rose, sun, talk, umbrella, voice, Walrus, 
xeon, Yam, zebra))rdd: org.apache.spark.rdd.RDD[String] = 
ParallelCollectionRDD[0] at parallelize at console:13
scala rdd.keyBy(s = s(0).toUpper)res0: org.apache.spark.rdd.RDD[(Char, 
String)] = MappedRDD[1] at keyBy at console:16
scala res0.partitionBy(new RangePartitioner[Char, String](26, 
res0)).valuesres1: org.apache.spark.rdd.RDD[String] = MappedRDD[5] at values at 
console:18
scala res1.mapPartitionsWithIndex((idx, itr) = itr.map(s = (idx, 
s))).collect.foreach(println)
The above example is clear for me to understand the meaning of the 
RangePartitioner, but to my surprise, I got the following result:
(0,apple)(0,Ball)(1,cat)(2,dog)(3,Elephant)(4,fox)(5,gas)(6,horse)(7,index)(8,jet)(9,kitsch)(10,long)(11,moon)(12,Neptune)(13,ooze)(14,Pen)(15,quiet)(16,rose)(17,sun)(18,talk)(19,umbrella)(20,voice)(21,Walrus)(22,xeon)(23,Yam)(24,zebra)
instead of a perfect range index from 0 to 25 in old email thread. Why is that? 
Is this a bug, or some new feature I don't understand?
BTW, the above environment I tested is in Spark 1.2.1 with Hadoop 2.4 binary 
release.
Thanks
Yong

RE: spark left outer join with java.lang.UnsupportedOperationException: empty collection

2015-02-12 Thread java8964

OK. I think I have to use None instead null, then it works. Still switching 
from Java.
I can also just use the field name as what I assume.
Great experience.

From: java8...@hotmail.com
To: user@spark.apache.org
Subject: spark left outer join with java.lang.UnsupportedOperationException: 
empty collection
Date: Thu, 12 Feb 2015 18:06:43 -0500

Hi, 
I am using Spark 1.2.0 with Hadoop 2.2. Now I have to 2 csv files, but have 8 
fields. I know that the first field from both files are IDs. I want to find all 
the IDs existed in the first file, but NOT in the 2nd file.
I am coming with the following code in spark-shell.
case class origAsLeft (id: String)case class newAsRight (id: String)val 
OrigData = sc.textFile(hdfs://firstfile).map(_.split(,)).map( r=(r(0), 
origAsLeft(r(0val NewData = 
sc.textFile(hdfs://secondfile).map(_.split(,)).map( r=(r(0), 
newAsRight(r(0val output = OrigData.leftOuterJoin(NewData).filter{ case (k, 
v) = v._2 == null }
Find what I understand, after OrigData left outer join with NewData, it will 
use the id as the key, and a tuple with (leftObject, RightObject).Since I want 
to find out all the IDs existed in the first file, but not in the 2nd one, so 
the output RDD will be the one I want, as it will filter out only when there is 
no newAsRight object from NewData.
Then I run 
output.first
Spark does start to run, but give me the following error message:15/02/12 
16:43:38 INFO scheduler.DAGScheduler: Job 4 finished: first at console:21, 
took 78.303549 sjava.lang.UnsupportedOperationException: empty collection   
 at org.apache.spark.rdd.RDD.first(RDD.scala:1095)   at 
$iwC$$iwC$$iwC$$iwC.init(console:21) at 
$iwC$$iwC$$iwC.init(console:26)  at $iwC$$iwC.init(console:28)   at 
$iwC.init(console:30)at init(console:32) at .init(console:36)   
 at .clinit(console) at .init(console:7) at .clinit(console) at 
$print(console)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:94)   
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
at java.lang.reflect.Method.invoke(Method.java:619) at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)   at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)   at 
org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705) at 
org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669) at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828) at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873) at 
org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785)   at 
org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:628) at 
org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:636)   at 
org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:641)  at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:968)
   at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916) 
 at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916) 
 at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:916)   
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1011)  at 
org.apache.spark.repl.Main$.main(Main.scala:31)  at 
org.apache.spark.repl.Main.main(Main.scala)  at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:94)   
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
at java.lang.reflect.Method.invoke(Method.java:619) at 
org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)   at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)  at 
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Did I do anything wrong? What is the way to find all the id in the first file, 
but not in the 2nd file?
Second question is how can I use the object field to do the compare in this 
case? For example, if I define:
case class origAsLeft (id: String, name: String)case class newAsRight (id: 
String, name: String)val OrigData = 
sc.textFile(hdfs://firstfile).map(_.split(,)).map( r=(r(0), 
origAsLeft(r(0), r(1val NewData = 
sc.textFile(hdfs://secondfile).map(_.split(,)).map( r=(r(0), 
newAsRight(r(0), r(1// in this case, I want to list all the data in the 
first file which has the same ID as in the 2nd file, but with different value 
in name, I want to do something like below:
val output = OrigData.join(NewData).filter{ case (k, v) = v._1.name != 
v._2.name }
But what is the

spark left outer join with java.lang.UnsupportedOperationException: empty collection

2015-02-12 Thread java8964

Hi, 
I am using Spark 1.2.0 with Hadoop 2.2. Now I have to 2 csv files, but have 8 
fields. I know that the first field from both files are IDs. I want to find all 
the IDs existed in the first file, but NOT in the 2nd file.
I am coming with the following code in spark-shell.
case class origAsLeft (id: String)case class newAsRight (id: String)val 
OrigData = sc.textFile(hdfs://firstfile).map(_.split(,)).map( r=(r(0), 
origAsLeft(r(0val NewData = 
sc.textFile(hdfs://secondfile).map(_.split(,)).map( r=(r(0), 
newAsRight(r(0val output = OrigData.leftOuterJoin(NewData).filter{ case (k, 
v) = v._2 == null }
Find what I understand, after OrigData left outer join with NewData, it will 
use the id as the key, and a tuple with (leftObject, RightObject).Since I want 
to find out all the IDs existed in the first file, but not in the 2nd one, so 
the output RDD will be the one I want, as it will filter out only when there is 
no newAsRight object from NewData.
Then I run 
output.first
Spark does start to run, but give me the following error message:15/02/12 
16:43:38 INFO scheduler.DAGScheduler: Job 4 finished: first at console:21, 
took 78.303549 sjava.lang.UnsupportedOperationException: empty collection   
 at org.apache.spark.rdd.RDD.first(RDD.scala:1095)   at 
$iwC$$iwC$$iwC$$iwC.init(console:21) at 
$iwC$$iwC$$iwC.init(console:26)  at $iwC$$iwC.init(console:28)   at 
$iwC.init(console:30)at init(console:32) at .init(console:36)   
 at .clinit(console) at .init(console:7) at .clinit(console) at 
$print(console)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:94)   
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
at java.lang.reflect.Method.invoke(Method.java:619) at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)   at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)   at 
org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705) at 
org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669) at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828) at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873) at 
org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785)   at 
org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:628) at 
org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:636)   at 
org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:641)  at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:968)
   at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916) 
 at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916) 
 at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:916)   
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1011)  at 
org.apache.spark.repl.Main$.main(Main.scala:31)  at 
org.apache.spark.repl.Main.main(Main.scala)  at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:94)   
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
at java.lang.reflect.Method.invoke(Method.java:619) at 
org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)   at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)  at 
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Did I do anything wrong? What is the way to find all the id in the first file, 
but not in the 2nd file?
Second question is how can I use the object field to do the compare in this 
case? For example, if I define:
case class origAsLeft (id: String, name: String)case class newAsRight (id: 
String, name: String)val OrigData = 
sc.textFile(hdfs://firstfile).map(_.split(,)).map( r=(r(0), 
origAsLeft(r(0), r(1val NewData = 
sc.textFile(hdfs://secondfile).map(_.split(,)).map( r=(r(0), 
newAsRight(r(0), r(1// in this case, I want to list all the data in the 
first file which has the same ID as in the 2nd file, but with different value 
in name, I want to do something like below:
val output = OrigData.join(NewData).filter{ case (k, v) = v._1.name != 
v._2.name }
But what is the syntax to use the field in the case class I defined?
Thanks
Yong

Spark concurrency question

2015-02-08 Thread java8964

Hi, I have some questions about how the spark run the job concurrently.
For example, if I setup the Spark on one standalone test box, which has 24 core 
and 64G memory. I setup the Worker memory to 48G, and Executor memory to 4G, 
and using spark-shell to run some jobs. Here is something confusing me:
1) Does the above setting mean that I can have up to 12 Executor running in 
this box at same time?2) Let's assume that I want to do a line count of one 
1280M HDFS file, which has 10 blocks as 128M per block. In this case, when the 
Spark program starts to run, will it kick off one executor using 10 threads to 
read these 10 blocks hdfs data, or 10 executors to read one block each? Or in 
other way? I read the Apache spark document, so I know that this 1280M HDFS 
file will be split as 10 partitions. But how the executor run them, I am not 
clear.3) In my test case, I started one Spark-shell to run a very expensive 
job. I saw in the Spark web UI, there are 8 stages generated, with 200 to 400 
tasks in each stage, and the tasks started to run. At this time, I started 
another spark shell to connect to master, and try to run a small spark program. 
From the spark-shell, it shows my new small program is in a wait status for 
resource. Why? And what kind of resources it is waiting for? If it is waiting 
for memory, does this means that there are 12 concurrent tasks running in the 
first program, took 12 * 4G = 48G memory given to the worker, so no more 
resource available? If so, in this case, then one running task is one 
executor?4) In MapReduce, the count of map and reducer tasks are the resource 
used by the cluster. My understanding is Spark is using multithread, instead of 
individual JVM processor. In this case, is the Executor using 4G heap to 
generate multithreads? My real question is that if each executor corresponding 
to each RDD partition, or executor could span thread for a RDD partition? On 
the other hand, how the worker decides how many executors to be created?
If there is any online document answering the above questions, please let me 
know. I searched in the Apache Spark site, but couldn't find it.
Thanks
Yong

My first experience with Spark

2015-02-05 Thread java8964

I am evaluating Spark for our production usage. Our production cluster is 
Hadoop 2.2.0 without Yarn. So I want to test Spark with Standalone deployment 
running with Hadoop.
What I have in mind is to test a very complex Hive query, which joins between 6 
tables, lots of nested structure with exploding, and currently takes 8 hours 
daily running in our production.
All the data of this query are in AVRO + Snappy.
I setup one Box (24 core + 64G memory), installed the same version of Hadoop as 
our production, and put 5% of data on it (which is about 60G, snappy compressed 
AVRO files)
I am running the same query in Hive. It took 6 rounds of MR jobs, finished 
around 30 hours on this one box.
Now, I start to have fun with Spark.
I checked out Spark 1.2.0, built it following Spark build instructions, and 
installed on this one box.
Since the test data is all in AVRO format, so I also built the latest 
development version of SparkAvro, from https://github.com/databricks/spark-avro
1) First, I got some problems to use the AVRO data in spark-avro. It turns our 
that Spark 1.2.0 build processing will merge the mismatched version of AVRO 
core and AVRO mapred jars. I manually fixed it. See issue here: 
https://github.com/databricks/spark-avro/issues/242) After that, I am impressed 
becauseThe AVRO file just works from HDFS to Spark 1.2The complex query (about 
200 lines) just starts to run in Spark 1.2 using 
org.apache.spark.sql.hive.HiveContext without any problem. This HiveContext 
just works in Spark SQL 1.2. Very nice.3) I got several OOM, which is 
reasonable. I finally changes the memory setting to: export 
SPARK_WORKER_MEMORY=8gexport SPARK_DRIVER_MEMORY=2gexport 
SPARK_EXECUTOR_MEMORY=8g   As 4g just doesn't work for the test data 
volume. After I set to 8G, the job won't fail due to OOM.
4) It looks like Spark generates 8 stages for the big query. It finishes the 
stage 1 and stage 2, then failed on stage 3 twice with the following error:







FetchFailed(null, shuffleId=7, mapId=-1, reduceId=7, 
message=org.apache.spark.shuffle.MetadataFetchFailedException: Missing an 
output location for shuffle 7at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:386)
 at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:383)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)  at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)at 
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)  at 
org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:382)
 at 
org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:178) 
 at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
 at 
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
 at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)   at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at 
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)   at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) 
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)   at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)   at 
org.apache.spark.scheduler.Task.run(Task.scala:56)   at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1176) 
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641) 
 at java.lang.Thread.run(Thread.java:853)












































)
During whole test, the CPUs load average is about 16, and still

RE: My first experience with Spark

2015-02-05 Thread java8964

Finally I gave up after there are too many failed retry.
From the log in the worker side, it looks like failed with JVM OOM, as below:
15/02/05 17:02:03 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception 
in thread Thread[Driver Heartbeater,5,main]java.lang.OutOfMemoryError: Java 
heap spaceat java.lang.StringBuilder.toString(StringBuilder.java:812)   
 at 
scala.collection.mutable.StringBuilder.toString(StringBuilder.scala:427)
at scala.concurrent.duration.FiniteDuration.unitString(Duration.scala:583)  
  at scala.concurrent.duration.FiniteDuration.toString(Duration.scala:584)  
  at java.lang.String.valueOf(String.java:1675)at 
scala.collection.mutable.StringBuilder.append(StringBuilder.scala:197)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)   
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)at 
org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:187)at 
org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:398)15/02/05 
17:02:03 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread 
Thread[org.apache.hadoop.hdfs.PeerCache@43fe286e,5,main]java.lang.OutOfMemoryError:
 Java heap spaceat 
org.spark-project.guava.common.collect.LinkedListMultimap$5.listIterator(LinkedListMultimap.java:912)
at java.util.AbstractList.listIterator(AbstractList.java:310)at 
java.util.AbstractSequentialList.iterator(AbstractSequentialList.java:250)  
  at org.apache.hadoop.hdfs.PeerCache.evictExpired(PeerCache.java:213)
at org.apache.hadoop.hdfs.PeerCache.run(PeerCache.java:255)at 
org.apache.hadoop.hdfs.PeerCache.access$000(PeerCache.java:39)at 
org.apache.hadoop.hdfs.PeerCache$1.run(PeerCache.java:135)at 
java.lang.Thread.run(Thread.java:853)15/02/05 17:02:03 ERROR executor.Executor: 
Exception in task 5.0 in stage 3.2 (TID 2618)
Is this due to OOM in the shuffle stage? I already set the 
SPARK_WORKER_MEMORY=8g, and I can see from the web UI it is 8g.  Any 
configuration that I can change to avoid the above OOM?
Thanks
Yong
From: java8...@hotmail.com
To: user@spark.apache.org
Subject: My first experience with Spark
Date: Thu, 5 Feb 2015 16:03:33 -0500




I am evaluating Spark for our production usage. Our production cluster is 
Hadoop 2.2.0 without Yarn. So I want to test Spark with Standalone deployment 
running with Hadoop.
What I have in mind is to test a very complex Hive query, which joins between 6 
tables, lots of nested structure with exploding, and currently takes 8 hours 
daily running in our production.
All the data of this query are in AVRO + Snappy.
I setup one Box (24 core + 64G memory), installed the same version of Hadoop as 
our production, and put 5% of data on it (which is about 60G, snappy compressed 
AVRO files)
I am running the same query in Hive. It took 6 rounds of MR jobs, finished 
around 30 hours on this one box.
Now, I start to have fun with Spark.
I checked out Spark 1.2.0, built it following Spark build instructions, and 
installed on this one box.
Since the test data is all in AVRO format, so I also built the latest 
development version of SparkAvro, from https://github.com/databricks/spark-avro
1) First, I got some problems to use the AVRO data in spark-avro. It turns our 
that Spark 1.2.0 build processing will merge the mismatched version of AVRO 
core and AVRO mapred jars. I manually fixed it. See issue here: 
https://github.com/databricks/spark-avro/issues/242) After that, I am impressed 
becauseThe AVRO file just works from HDFS to Spark 1.2The complex query (about 
200 lines) just starts to run in Spark 1.2 using 
org.apache.spark.sql.hive.HiveContext without any problem. This HiveContext 
just works in Spark SQL 1.2. Very nice.3) I got several OOM, which is 
reasonable. I finally changes the memory setting to: export 
SPARK_WORKER_MEMORY=8gexport SPARK_DRIVER_MEMORY=2gexport 
SPARK_EXECUTOR_MEMORY=8g   As 4g just doesn't work for the test data 
volume. After I set to 8G, the job won't fail due to OOM.
4) It looks like Spark generates 8 stages for the big query. It finishes the 
stage 1 and stage 2, then failed on stage 3 twice with the following error:







FetchFailed(null, shuffleId=7, mapId=-1, reduceId=7, 
message=org.apache.spark.shuffle.MetadataFetchFailedException: Missing an 
output location for shuffle 7at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:386)
 at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:383)
 at

RE: My first experience with Spark

2015-02-05 Thread java8964

Hi, Deb:From what I search online, changing parallelism is one option. But the 
failed stage already had 200 tasks, which is quite large on a one 24 core box.I 
know query that amount of data in one box is kind of over, but I do want to 
know how to config it using less memory, even it could mean using more time.We 
plan to make spark coexist with Hadoop cluster, so be able to control its 
memory usage is important for us.Does spark need that much of memory?ThanksYong
Date: Thu, 5 Feb 2015 15:36:48 -0800
Subject: Re: My first experience with Spark
From: deborah.sie...@gmail.com
To: java8...@hotmail.com
CC: user@spark.apache.org

Hi Yong, 
Have you tried increasing your level of parallelism? How many tasks are you 
getting in failing stage? 2-3 tasks per CPU core is recommended, though maybe 
you need more for your shuffle operation?

You can configure spark.default.parallelism, or pass in a level of parallelism 
as second parameter to a suitable operation in your code. 

Deb
On Thu, Feb 5, 2015 at 1:03 PM, java8964 java8...@hotmail.com wrote:

I am evaluating Spark for our production usage. Our production cluster is 
Hadoop 2.2.0 without Yarn. So I want to test Spark with Standalone deployment 
running with Hadoop.
What I have in mind is to test a very complex Hive query, which joins between 6 
tables, lots of nested structure with exploding, and currently takes 8 hours 
daily running in our production.
All the data of this query are in AVRO + Snappy.
I setup one Box (24 core + 64G memory), installed the same version of Hadoop as 
our production, and put 5% of data on it (which is about 60G, snappy compressed 
AVRO files)
I am running the same query in Hive. It took 6 rounds of MR jobs, finished 
around 30 hours on this one box.
Now, I start to have fun with Spark.
I checked out Spark 1.2.0, built it following Spark build instructions, and 
installed on this one box.
Since the test data is all in AVRO format, so I also built the latest 
development version of SparkAvro, from https://github.com/databricks/spark-avro
1) First, I got some problems to use the AVRO data in spark-avro. It turns our 
that Spark 1.2.0 build processing will merge the mismatched version of AVRO 
core and AVRO mapred jars. I manually fixed it. See issue here: 
https://github.com/databricks/spark-avro/issues/242) After that, I am impressed 
becauseThe AVRO file just works from HDFS to Spark 1.2The complex query (about 
200 lines) just starts to run in Spark 1.2 using 
org.apache.spark.sql.hive.HiveContext without any problem. This HiveContext 
just works in Spark SQL 1.2. Very nice.3) I got several OOM, which is 
reasonable. I finally changes the memory setting to: export 
SPARK_WORKER_MEMORY=8gexport SPARK_DRIVER_MEMORY=2gexport 
SPARK_EXECUTOR_MEMORY=8g   As 4g just doesn't work for the test data 
volume. After I set to 8G, the job won't fail due to OOM.
4) It looks like Spark generates 8 stages for the big query. It finishes the 
stage 1 and stage 2, then failed on stage 3 twice with the following error:

FetchFailed(null, shuffleId=7, mapId=-1, reduceId=7, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 7
at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:386)
at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:383)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:382)
at 
org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:178)
at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
at 
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88

Problem to run spark as standalone

2014-10-27 Thread java8964

Hi, Spark Users:
I tried to test the spark in a standalone box, but faced an issue which I don't 
know what is the root cause. I basically followed exactly document of deploy 
spark in a standalone environment.
1) I check out spark source code of release 1.1.02) I build the spark with 
following command: ./make-distribution.sh -Pyarn -Phadoop-2.4 
-Dhadoop.version=2.4.0 -DskipTests, Succeeded.3) I make sure that I can ssh to 
the localhost as myself using ssh key.4) I run the sbin/start-all.sh, it looks 
fine, at least I saw 2 java processes running.5) When I run the following 
command: yzhang@yzhang-linux:/opt/spark-1.1.0-bin-hadoop2.4.0/bin$ 
./spark-shell --master spark://yzhang-linux:7077
I saw the following message, then the shell exits itself.
14/10/27 11:22:53 INFO repl.SparkILoop: Created spark context..Spark context 
available as sc.
scala 14/10/27 11:23:13 INFO client.AppClient$ClientActor: Connecting to 
master spark://yzhang-linux:7077...14/10/27 11:23:33 INFO 
client.AppClient$ClientActor: Connecting to master 
spark://yzhang-linux:7077...14/10/27 11:23:53 ERROR 
cluster.SparkDeploySchedulerBackend: Application has been killed. Reason: All 
masters are unresponsive! Giving up.14/10/27 11:23:53 ERROR 
scheduler.TaskSchedulerImpl: Exiting due to error from cluster scheduler: All 
masters are unresponsive! Giving up.
Now, I check the log file, and found out the following message in the master 
log:
 14/10/27 11:22:53 ERROR remote.EndpointWriter: dropping message [class 
akka.actor.SelectChildName] for non-local recipient 
[Actor[akka.tcp://sparkMaster@yzhang-linux:7077/]] arriving at 
[akka.tcp://sparkMaster@yzhang-linux:7077] inbound addresses are 
[akka.tcp://sparkMaster@yzhang-linux:7077]14/10/27 11:23:13 ERROR 
remote.EndpointWriter: dropping message [class akka.actor.SelectChildName] for 
non-local recipient [Actor[akka.tcp://sparkMaster@yzhang-linux:7077/]] arriving 
at [akka.tcp://sparkMaster@yzhang-linux:7077] inbound addresses are 
[akka.tcp://sparkMaster@yzhang-linux:7077]14/10/27 11:23:33 ERROR 
remote.EndpointWriter: dropping message [class akka.actor.SelectChildName] for 
non-local recipient [Actor[akka.tcp://sparkMaster@yzhang-linux:7077/]] arriving 
at [akka.tcp://sparkMaster@yzhang-linux:7077] inbound addresses are 
[akka.tcp://sparkMaster@yzhang-linux:7077]14/10/27 11:23:53 INFO master.Master: 
akka.tcp://sparkDriver@yzhang-linux:44017 got disassociated, removing 
it.14/10/27 11:23:53 INFO master.Master: 
akka.tcp://sparkDriver@yzhang-linux:44017 got disassociated, removing 
it.14/10/27 11:23:53 INFO actor.LocalActorRef: Message 
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from 
Actor[akka://sparkMaster/deadLetters] to 
Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40192.168.240.8%3A63348-2#1992401281]
 was not delivered. [1] dead letters encountered. This logging can be turned 
off or adjusted with configuration settings 'akka.log-dead-letters' and 
'akka.log-dead-letters-during-shutdown'.14/10/27 11:23:53 ERROR 
remote.EndpointWriter: AssociationError 
[akka.tcp://sparkMaster@yzhang-linux:7077] - 
[akka.tcp://sparkDriver@yzhang-linux:44017]: Error [Association failed with 
[akka.tcp://sparkDriver@yzhang-linux:44017]] 
[akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://sparkDriver@yzhang-linux:44017]Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: yzhang-linux/192.168.240.8:44017]14/10/27 11:23:53 INFO 
master.Master: akka.tcp://sparkDriver@yzhang-linux:44017 got disassociated, 
removing it.14/10/27 11:23:53 INFO master.Master: 
akka.tcp://sparkDriver@yzhang-linux:44017 got disassociated, removing 
it.14/10/27 11:23:53 ERROR remote.EndpointWriter: AssociationError 
[akka.tcp://sparkMaster@yzhang-linux:7077] - 
[akka.tcp://sparkDriver@yzhang-linux:44017]: Error [Association failed with 
[akka.tcp://sparkDriver@yzhang-linux:44017]] 
[akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://sparkDriver@yzhang-linux:44017]Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: yzhang-linux/192.168.240.8:44017]14/10/27 11:23:53 ERROR 
remote.EndpointWriter: AssociationError 
[akka.tcp://sparkMaster@yzhang-linux:7077] - 
[akka.tcp://sparkDriver@yzhang-linux:44017]: Error [Association failed with 
[akka.tcp://sparkDriver@yzhang-linux:44017]] 
[akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://sparkDriver@yzhang-linux:44017]Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: yzhang-linux/192.168.240.8:44017]14/10/27 11:23:53 INFO 
master.Master: akka.tcp://sparkDriver@yzhang-linux:44017 got disassociated, 
removing it.
Any reason why this is happening? The web UI of spark looks normal. There is no 
error message in the worker log. This is a standalone box, no

RE: Problem to run spark as standalone

2014-10-27 Thread java8964

I did a little more research about this. It looks like the worker started 
successfully, but on port 40294. This is shown in both log and master web UI.
The question is that in the log, the master akka.tcp is trying to connect to 
another different port (44017). Why?
Yong

From: java8...@hotmail.com
To: u...@spark.incubator.apache.org
Subject: Problem to run spark as standalone
Date: Mon, 27 Oct 2014 11:38:32 -0400




Hi, Spark Users:
I tried to test the spark in a standalone box, but faced an issue which I don't 
know what is the root cause. I basically followed exactly document of deploy 
spark in a standalone environment.
1) I check out spark source code of release 1.1.02) I build the spark with 
following command: ./make-distribution.sh -Pyarn -Phadoop-2.4 
-Dhadoop.version=2.4.0 -DskipTests, Succeeded.3) I make sure that I can ssh to 
the localhost as myself using ssh key.4) I run the sbin/start-all.sh, it looks 
fine, at least I saw 2 java processes running.5) When I run the following 
command: yzhang@yzhang-linux:/opt/spark-1.1.0-bin-hadoop2.4.0/bin$ 
./spark-shell --master spark://yzhang-linux:7077
I saw the following message, then the shell exits itself.
14/10/27 11:22:53 INFO repl.SparkILoop: Created spark context..Spark context 
available as sc.
scala 14/10/27 11:23:13 INFO client.AppClient$ClientActor: Connecting to 
master spark://yzhang-linux:7077...14/10/27 11:23:33 INFO 
client.AppClient$ClientActor: Connecting to master 
spark://yzhang-linux:7077...14/10/27 11:23:53 ERROR 
cluster.SparkDeploySchedulerBackend: Application has been killed. Reason: All 
masters are unresponsive! Giving up.14/10/27 11:23:53 ERROR 
scheduler.TaskSchedulerImpl: Exiting due to error from cluster scheduler: All 
masters are unresponsive! Giving up.
Now, I check the log file, and found out the following message in the master 
log:
 14/10/27 11:22:53 ERROR remote.EndpointWriter: dropping message [class 
akka.actor.SelectChildName] for non-local recipient 
[Actor[akka.tcp://sparkMaster@yzhang-linux:7077/]] arriving at 
[akka.tcp://sparkMaster@yzhang-linux:7077] inbound addresses are 
[akka.tcp://sparkMaster@yzhang-linux:7077]14/10/27 11:23:13 ERROR 
remote.EndpointWriter: dropping message [class akka.actor.SelectChildName] for 
non-local recipient [Actor[akka.tcp://sparkMaster@yzhang-linux:7077/]] arriving 
at [akka.tcp://sparkMaster@yzhang-linux:7077] inbound addresses are 
[akka.tcp://sparkMaster@yzhang-linux:7077]14/10/27 11:23:33 ERROR 
remote.EndpointWriter: dropping message [class akka.actor.SelectChildName] for 
non-local recipient [Actor[akka.tcp://sparkMaster@yzhang-linux:7077/]] arriving 
at [akka.tcp://sparkMaster@yzhang-linux:7077] inbound addresses are 
[akka.tcp://sparkMaster@yzhang-linux:7077]14/10/27 11:23:53 INFO master.Master: 
akka.tcp://sparkDriver@yzhang-linux:44017 got disassociated, removing 
it.14/10/27 11:23:53 INFO master.Master: 
akka.tcp://sparkDriver@yzhang-linux:44017 got disassociated, removing 
it.14/10/27 11:23:53 INFO actor.LocalActorRef: Message 
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from 
Actor[akka://sparkMaster/deadLetters] to 
Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40192.168.240.8%3A63348-2#1992401281]
 was not delivered. [1] dead letters encountered. This logging can be turned 
off or adjusted with configuration settings 'akka.log-dead-letters' and 
'akka.log-dead-letters-during-shutdown'.14/10/27 11:23:53 ERROR 
remote.EndpointWriter: AssociationError 
[akka.tcp://sparkMaster@yzhang-linux:7077] - 
[akka.tcp://sparkDriver@yzhang-linux:44017]: Error [Association failed with 
[akka.tcp://sparkDriver@yzhang-linux:44017]] 
[akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://sparkDriver@yzhang-linux:44017]Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: yzhang-linux/192.168.240.8:44017]14/10/27 11:23:53 INFO 
master.Master: akka.tcp://sparkDriver@yzhang-linux:44017 got disassociated, 
removing it.14/10/27 11:23:53 INFO master.Master: 
akka.tcp://sparkDriver@yzhang-linux:44017 got disassociated, removing 
it.14/10/27 11:23:53 ERROR remote.EndpointWriter: AssociationError 
[akka.tcp://sparkMaster@yzhang-linux:7077] - 
[akka.tcp://sparkDriver@yzhang-linux:44017]: Error [Association failed with 
[akka.tcp://sparkDriver@yzhang-linux:44017]] 
[akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://sparkDriver@yzhang-linux:44017]Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: yzhang-linux/192.168.240.8:44017]14/10/27 11:23:53 ERROR 
remote.EndpointWriter: AssociationError 
[akka.tcp://sparkMaster@yzhang-linux:7077] - 
[akka.tcp://sparkDriver@yzhang-linux:44017]: Error [Association failed with 
[akka.tcp://sparkDriver@yzhang-linux:44017]] 
[akka.remote.EndpointAssociationException: Association failed with

98 matches

Mail list logo