date:20150505

How to separate messages of different topics.

2015-05-05 Thread Guillermo Ortiz

I want to read from many topics in Kafka and know from where each message
is coming (topic1, topic2 and so on).

 val kafkaParams = Map[String, String](metadata.broker.list -
myKafka:9092)
 val topics = Set(EntryLog, presOpManager)
 val directKafkaStream = KafkaUtils.createDirectStream[String, String,
StringDecoder, StringDecoder](ssc, kafkaParams, topics)

 Is there some way to separate the messages for topics with just one
directStream? or should I create different streamings for each topic?

Re: JAVA for SPARK certification

2015-05-05 Thread ayan guha

And how important is to have production environment?
On 5 May 2015 20:51, Stephen Boesch java...@gmail.com wrote:

 There are questions in all three languages.

 2015-05-05 3:49 GMT-07:00 Kartik Mehta kartik.meht...@gmail.com:

 I too have similar question.

 My understanding is since Spark written in scala, having done in Scala
 will be ok for certification.

 If someone who has done certification can confirm.

 Thanks,

 Kartik
 On May 5, 2015 5:57 AM, Gourav Sengupta gourav.sengu...@gmail.com
 wrote:

 Hi,

 how important is JAVA for Spark certification? Will learning only Python
 and Scala not work?


 Regards,
 Gourav

Spark + Kakfa with directStream

2015-05-05 Thread Guillermo Ortiz

I'm tryting to execute the Hello World example with Spark + Kafka (
https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala)
with createDirectStream and I get this error.


java.lang.NoSuchMethodError:
kafka.message.MessageAndMetadata.init(Ljava/lang/String;ILkafka/message/Message;JLkafka/serializer/Decoder;Lkafka/serializer/Decoder;)V
at
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:176)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

I have checked the jars and it's the kafka-2.10-0.8.2.jar in the classpath
with the MessageAndMetadata class. Even I navigated through eclipse and I
came to this class with the right parameters.

My pom.xml has these dependecies

dependency
groupIdorg.apache.spark/groupId
artifactIdspark-streaming_2.10/artifactId
version1.3.1/version
!-- scopeprovided/scope --
exclusions
exclusion
groupIdorg.codehaus.jackson/groupId
artifactIdjackson-mapper-asl/artifactId
/exclusion
exclusion
groupIdorg.codehaus.jackson/groupId
artifactIdjackson-core-asl/artifactId
/exclusion
/exclusions
/dependency


dependency
groupIdorg.apache.kafka/groupId
artifactIdkafka_2.10/artifactId
version0.8.2.1/version
/dependency

dependency
groupIdorg.apache.spark/groupId
artifactIdspark-streaming-kafka_2.10/artifactId
version1.3.1/version
!-- scopeprovided/scope --
exclusions
exclusion
groupIdorg.codehaus.jackson/groupId
artifactIdjackson-mapper-asl/artifactId
/exclusion
exclusion
groupIdorg.codehaus.jackson/groupId
artifactIdjackson-core-asl/artifactId
/exclusion
/exclusions
/dependency
/dependencies

Does somebody know where it's the error?? Thanks.

RE: 回复：Re: sparksql running slow while joining_2_tables.

2015-05-05 Thread Cheng, Hao

56mb / 26mb is very small size, do you observe data skew? More precisely, many 
records with the same chrname / name?  And can you also double check the jvm 
settings for the executor process?

From: luohui20...@sina.com [mailto:luohui20...@sina.com]
Sent: Tuesday, May 5, 2015 7:50 PM
To: Cheng, Hao; Wang, Daoyuan; Olivier Girardot; user
Subject: 回复：Re: sparksql running slow while joining_2_tables.

Hi guys,

  attache the pic of physical plan and logs.Thanks.

Thanksamp;Best regards!
罗辉 San.Luo

- 原始邮件 -
发件人：Cheng, Hao hao.ch...@intel.commailto:hao.ch...@intel.com
收件人：Wang, Daoyuan daoyuan.w...@intel.commailto:daoyuan.w...@intel.com, 
luohui20...@sina.commailto:luohui20...@sina.com 
luohui20...@sina.commailto:luohui20...@sina.com, Olivier Girardot 
ssab...@gmail.commailto:ssab...@gmail.com, user 
user@spark.apache.orgmailto:user@spark.apache.org
主题：Re: sparksql running slow while joining_2_tables.
日期：2015年05月05日 13点18分

I assume you’re using the DataFrame API within your application.

sql(“SELECT…”).explain(true)

From: Wang, Daoyuan
Sent: Tuesday, May 5, 2015 10:16 AM
To: luohui20...@sina.commailto:luohui20...@sina.com; Cheng, Hao; Olivier 
Girardot; user
Subject: RE: 回复：RE: 回复：Re: sparksql running slow while joining_2_tables.

You can use

Explain extended select ….

From: luohui20...@sina.commailto:luohui20...@sina.com 
[mailto:luohui20...@sina.com]
Sent: Tuesday, May 05, 2015 9:52 AM
To: Cheng, Hao; Olivier Girardot; user
Subject: 回复：RE: 回复：Re: sparksql running slow while joining_2_tables.

As I know broadcastjoin is automatically enabled by 
spark.sql.autoBroadcastJoinThreshold.

refer to 
http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options

and how to check my app's physical plan,and others things like optimized 
plan,executable plan.etc

thanks

Thanksamp;Best regards!
罗辉 San.Luo

- 原始邮件 -
发件人：Cheng, Hao hao.ch...@intel.commailto:hao.ch...@intel.com
收件人：Cheng, Hao hao.ch...@intel.commailto:hao.ch...@intel.com, 
luohui20...@sina.commailto:luohui20...@sina.com 
luohui20...@sina.commailto:luohui20...@sina.com, Olivier Girardot 
ssab...@gmail.commailto:ssab...@gmail.com, user 
user@spark.apache.orgmailto:user@spark.apache.org
主题：RE: 回复：Re: sparksql running slow while joining_2_tables.
日期：2015年05月05日 08点38分

Or, have you ever try broadcast join?

From: Cheng, Hao [mailto:hao.ch...@intel.com]
Sent: Tuesday, May 5, 2015 8:33 AM
To: luohui20...@sina.commailto:luohui20...@sina.com; Olivier Girardot; user
Subject: RE: 回复：Re: sparksql running slow while joining 2 tables.

Can you print out the physical plan?

EXPLAIN SELECT xxx…

From: luohui20...@sina.commailto:luohui20...@sina.com 
[mailto:luohui20...@sina.com]
Sent: Monday, May 4, 2015 9:08 PM
To: Olivier Girardot; user
Subject: 回复：Re: sparksql running slow while joining 2 tables.

hi Olivier

spark1.3.1, with java1.8.0.45

and add 2 pics .

it seems like a GC issue. I also tried with different parameters like memory 
size of driverexecutor, memory fraction, java opts...

but this issue still happens.

Thanksamp;Best regards!
罗辉 San.Luo

- 原始邮件 -
发件人：Olivier Girardot ssab...@gmail.commailto:ssab...@gmail.com
收件人：luohui20...@sina.commailto:luohui20...@sina.com, user 
user@spark.apache.orgmailto:user@spark.apache.org
主题：Re: sparksql running slow while joining 2 tables.
日期：2015年05月04日 20点46分

Hi,
What is you Spark version ?

Regards,

Olivier.

Le lun. 4 mai 2015 à 11:03, luohui20...@sina.commailto:luohui20...@sina.com 
a écrit :

hi guys

when i am running a sql  like select 
a.namehttp://a.name/,a.startpoint,a.endpoint, a.piece from db a join sample b 
on (a.namehttp://a.name/ = b.namehttp://b.name/) where (b.startpoint  
a.startpoint + 25); I found sparksql running slow in minutes which may caused 
by very long GC and shuffle time.

   table db is created from a txt file size at 56mb while table sample 
sized at 26mb, both at small size.

   my spark cluster is a standalone  pseudo-distributed spark cluster with 
8g executor and 4g driver manager.

   any advises? thank you guys.

Thanksamp;Best regards!
罗辉 San.Luo

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org

multiple hdfs folder files input to PySpark

2015-05-05 Thread Oleg Ruchovets

Hi
   We are using pyspark 1.3 and input is text files located on hdfs.

file structure
day1
file1.txt
file2.txt
day2
file1.txt
file2.txt
 ...

Question:

   1) What is the way to provide as an input for PySpark job  multiple
files which located in Multiple folders (on hdfs).
Using textFile method works fine for single file or folder , but how can I
do it using multiple folders?
Is there a way to pass array , list of files?

   2) What is the meaning of partition parameter in textFile method?

  sc = SparkContext(appName=TAD)
  lines = sc.textFile(my input, 1)

Thanks
Oleg.

Re: JAVA for SPARK certification

2015-05-05 Thread Kartik Mehta

Production - not whole lot of companies have implemented Spark in
production and so though it is good to have, not must.

If you are on LinkedIn, a group of folks including myself are preparing for
Spark certification, learning in group makes learning easy and fun.

Kartik
On May 5, 2015 7:31 AM, ayan guha guha.a...@gmail.com wrote:

 And how important is to have production environment?
 On 5 May 2015 20:51, Stephen Boesch java...@gmail.com wrote:

 There are questions in all three languages.

 2015-05-05 3:49 GMT-07:00 Kartik Mehta kartik.meht...@gmail.com:

 I too have similar question.

 My understanding is since Spark written in scala, having done in Scala
 will be ok for certification.

 If someone who has done certification can confirm.

 Thanks,

 Kartik
 On May 5, 2015 5:57 AM, Gourav Sengupta gourav.sengu...@gmail.com
 wrote:

 Hi,

 how important is JAVA for Spark certification? Will learning only
 Python and Scala not work?


 Regards,
 Gourav

Re: Unable to join table across data sources using sparkSQL

2015-05-05 Thread ayan guha

I suggest you to try with date_dim.d_year in the query

On Tue, May 5, 2015 at 10:47 PM, Ishwardeep Singh 
ishwardeep.si...@impetus.co.in wrote:

  Hi Ankit,



 printSchema() works fine for all the tables.



 hiveStoreSalesDF.printSchema()

 root

 |-- store_sales.ss_sold_date_sk: integer (nullable = true)

 |-- store_sales.ss_sold_time_sk: integer (nullable = true)

 |-- store_sales.ss_item_sk: integer (nullable = true)

 |-- store_sales.ss_customer_sk: integer (nullable = true)

 |-- store_sales.ss_cdemo_sk: integer (nullable = true)

 |-- store_sales.ss_hdemo_sk: integer (nullable = true)

 |-- store_sales.ss_addr_sk: integer (nullable = true)

 |-- store_sales.ss_store_sk: integer (nullable = true)

 |-- store_sales.ss_promo_sk: integer (nullable = true)

 |-- store_sales.ss_ticket_number: integer (nullable = true)

 |-- store_sales.ss_quantity: integer (nullable = true)

 |-- store_sales.ss_wholesale_cost: double (nullable = true)

 |-- store_sales.ss_list_price: double (nullable = true)

 |-- store_sales.ss_sales_price: double (nullable = true)

 |-- store_sales.ss_ext_discount_amt: double (nullable = true)

 |-- store_sales.ss_ext_sales_price: double (nullable = true)

 |-- store_sales.ss_ext_wholesale_cost: double (nullable = true)

 |-- store_sales.ss_ext_list_price: double (nullable = true)

 |-- store_sales.ss_ext_tax: double (nullable = true)

 |-- store_sales.ss_coupon_amt: double (nullable = true)

 |-- store_sales.ss_net_paid: double (nullable = true)

 |-- store_sales.ss_net_paid_inc_tax: double (nullable = true)

 |-- store_sales.ss_net_profit: double (nullable = true)



 dateDimDF.printSchema()

 root

 |-- d_date_sk: integer (nullable = false)

 |-- d_date_id: string (nullable = false)

 |-- d_date: date (nullable = true)

 |-- d_month_seq: integer (nullable = true)

 |-- d_week_seq: integer (nullable = true)

 |-- d_quarter_seq: integer (nullable = true)

 |-- d_year: integer (nullable = true)

 |-- d_dow: integer (nullable = true)

 |-- d_moy: integer (nullable = true)

 |-- d_dom: integer (nullable = true)

 |-- d_qoy: integer (nullable = true)

 |-- d_fy_year: integer (nullable = true)

 |-- d_fy_quarter_seq: integer (nullable = true)

 |-- d_fy_week_seq: integer (nullable = true)

 |-- d_day_name: string (nullable = true)

 |-- d_quarter_name: string (nullable = true)

 |-- d_holiday: string (nullable = true)

 |-- d_weekend: string (nullable = true)

 |-- d_following_holiday: string (nullable = true)

 |-- d_first_dom: integer (nullable = true)

 |-- d_last_dom: integer (nullable = true)

 |-- d_same_day_ly: integer (nullable = true)

 |-- d_same_day_lq: integer (nullable = true)

 |-- d_current_day: string (nullable = true)

 |-- d_current_week: string (nullable = true)

 |-- d_current_month: string (nullable = true)

 |-- d_current_quarter: string (nullable = true)

 |-- d_current_year: string (nullable = true)



 itemDF.printSchema()

 root

 |-- i_item_sk: integer (nullable = false)

 |-- i_item_id: string (nullable = false)

 |-- i_rec_start_date: date (nullable = true)

 |-- i_rec_end_date: date (nullable = true)

 |-- i_item_desc: string (nullable = true)

 |-- i_current_price: decimal (nullable = true)

 |-- i_wholesale_cost: decimal (nullable = true)

 |-- i_brand_id: integer (nullable = true)

 |-- i_brand: string (nullable = true)

 |-- i_class_id: integer (nullable = true)

 |-- i_class: string (nullable = true)

 |-- i_category_id: integer (nullable = true)

 |-- i_category: string (nullable = true)

 |-- i_manufact_id: integer (nullable = true)

 |-- i_manufact: string (nullable = true)

 |-- i_size: string (nullable = true)

 |-- i_formulation: string (nullable = true)

 |-- i_color: string (nullable = true)

 |-- i_units: string (nullable = true)

 |-- i_container: string (nullable = true)

 |-- i_manager_id: integer (nullable = true)

 |-- i_product_name: string (nullable = true)



 Regards,

 Ishwardeep



 *From:* ankitjindal [via Apache Spark User List] [mailto:ml-node+[hidden
 email] http:///user/SendEmail.jtp?type=nodenode=22768i=0]
 *Sent:* Tuesday, May 5, 2015 5:00 PM
 *To:* Ishwardeep Singh
 *Subject:* RE: Unable to join table across data sources using sparkSQL



 Just check the Schema of both the tables using frame.printSchema();
  --

 *If you reply to this email, your message will be added to the discussion
 below:*


 http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-join-table-across-data-sources-using-sparkSQL-tp22761p22766.html

 To unsubscribe from Unable to join table across data sources using
 sparkSQL, click here.
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml

Re: Spark + Kakfa with directStream

2015-05-05 Thread Guillermo Ortiz

Sorry, I had a duplicated kafka dependency with another older version in
another pom.xml

2015-05-05 14:46 GMT+02:00 Guillermo Ortiz konstt2...@gmail.com:

 I'm tryting to execute the Hello World example with Spark + Kafka (
 https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala)
 with createDirectStream and I get this error.


 java.lang.NoSuchMethodError:
 kafka.message.MessageAndMetadata.init(Ljava/lang/String;ILkafka/message/Message;JLkafka/serializer/Decoder;Lkafka/serializer/Decoder;)V
 at
 org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:176)
 at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

 I have checked the jars and it's the kafka-2.10-0.8.2.jar in the classpath
 with the MessageAndMetadata class. Even I navigated through eclipse and I
 came to this class with the right parameters.

 My pom.xml has these dependecies

 dependency
 groupIdorg.apache.spark/groupId
 artifactIdspark-streaming_2.10/artifactId
 version1.3.1/version
 !-- scopeprovided/scope --
 exclusions
 exclusion
 groupIdorg.codehaus.jackson/groupId
 artifactIdjackson-mapper-asl/artifactId
 /exclusion
 exclusion
 groupIdorg.codehaus.jackson/groupId
 artifactIdjackson-core-asl/artifactId
 /exclusion
 /exclusions
 /dependency


 dependency
 groupIdorg.apache.kafka/groupId
 artifactIdkafka_2.10/artifactId
 version0.8.2.1/version
 /dependency

 dependency
 groupIdorg.apache.spark/groupId
 artifactIdspark-streaming-kafka_2.10/artifactId
 version1.3.1/version
 !-- scopeprovided/scope --
 exclusions
 exclusion
 groupIdorg.codehaus.jackson/groupId
 artifactIdjackson-mapper-asl/artifactId
 /exclusion
 exclusion
 groupIdorg.codehaus.jackson/groupId
 artifactIdjackson-core-asl/artifactId
 /exclusion
 /exclusions
 /dependency
 /dependencies

 Does somebody know where it's the error?? Thanks.

Re: JAVA for SPARK certification

2015-05-05 Thread Zoltán Zvara

I might join in to this conversation with an ask. Would someone point me to
a decent exercise that would approximate the level of this exam (from
above)? Thanks!

On Tue, May 5, 2015 at 3:37 PM Kartik Mehta kartik.meht...@gmail.com
wrote:

 Production - not whole lot of companies have implemented Spark in
 production and so though it is good to have, not must.

 If you are on LinkedIn, a group of folks including myself are preparing
 for Spark certification, learning in group makes learning easy and fun.

 Kartik
 On May 5, 2015 7:31 AM, ayan guha guha.a...@gmail.com wrote:

 And how important is to have production environment?
 On 5 May 2015 20:51, Stephen Boesch java...@gmail.com wrote:

 There are questions in all three languages.

 2015-05-05 3:49 GMT-07:00 Kartik Mehta kartik.meht...@gmail.com:

 I too have similar question.

 My understanding is since Spark written in scala, having done in Scala
 will be ok for certification.

 If someone who has done certification can confirm.

 Thanks,

 Kartik
 On May 5, 2015 5:57 AM, Gourav Sengupta gourav.sengu...@gmail.com
 wrote:

 Hi,

 how important is JAVA for Spark certification? Will learning only
 Python and Scala not work?


 Regards,
 Gourav

RE: Unable to join table across data sources using sparkSQL

2015-05-05 Thread Ishwardeep Singh

Hi Ankit,

printSchema() works fine for all the tables.

hiveStoreSalesDF.printSchema()
root
|-- store_sales.ss_sold_date_sk: integer (nullable = true)
|-- store_sales.ss_sold_time_sk: integer (nullable = true)
|-- store_sales.ss_item_sk: integer (nullable = true)
|-- store_sales.ss_customer_sk: integer (nullable = true)
|-- store_sales.ss_cdemo_sk: integer (nullable = true)
|-- store_sales.ss_hdemo_sk: integer (nullable = true)
|-- store_sales.ss_addr_sk: integer (nullable = true)
|-- store_sales.ss_store_sk: integer (nullable = true)
|-- store_sales.ss_promo_sk: integer (nullable = true)
|-- store_sales.ss_ticket_number: integer (nullable = true)
|-- store_sales.ss_quantity: integer (nullable = true)
|-- store_sales.ss_wholesale_cost: double (nullable = true)
|-- store_sales.ss_list_price: double (nullable = true)
|-- store_sales.ss_sales_price: double (nullable = true)
|-- store_sales.ss_ext_discount_amt: double (nullable = true)
|-- store_sales.ss_ext_sales_price: double (nullable = true)
|-- store_sales.ss_ext_wholesale_cost: double (nullable = true)
|-- store_sales.ss_ext_list_price: double (nullable = true)
|-- store_sales.ss_ext_tax: double (nullable = true)
|-- store_sales.ss_coupon_amt: double (nullable = true)
|-- store_sales.ss_net_paid: double (nullable = true)
|-- store_sales.ss_net_paid_inc_tax: double (nullable = true)
|-- store_sales.ss_net_profit: double (nullable = true)

dateDimDF.printSchema()
root
|-- d_date_sk: integer (nullable = false)
|-- d_date_id: string (nullable = false)
|-- d_date: date (nullable = true)
|-- d_month_seq: integer (nullable = true)
|-- d_week_seq: integer (nullable = true)
|-- d_quarter_seq: integer (nullable = true)
|-- d_year: integer (nullable = true)
|-- d_dow: integer (nullable = true)
|-- d_moy: integer (nullable = true)
|-- d_dom: integer (nullable = true)
|-- d_qoy: integer (nullable = true)
|-- d_fy_year: integer (nullable = true)
|-- d_fy_quarter_seq: integer (nullable = true)
|-- d_fy_week_seq: integer (nullable = true)
|-- d_day_name: string (nullable = true)
|-- d_quarter_name: string (nullable = true)
|-- d_holiday: string (nullable = true)
|-- d_weekend: string (nullable = true)
|-- d_following_holiday: string (nullable = true)
|-- d_first_dom: integer (nullable = true)
|-- d_last_dom: integer (nullable = true)
|-- d_same_day_ly: integer (nullable = true)
|-- d_same_day_lq: integer (nullable = true)
|-- d_current_day: string (nullable = true)
|-- d_current_week: string (nullable = true)
|-- d_current_month: string (nullable = true)
|-- d_current_quarter: string (nullable = true)
|-- d_current_year: string (nullable = true)

itemDF.printSchema()
root
|-- i_item_sk: integer (nullable = false)
|-- i_item_id: string (nullable = false)
|-- i_rec_start_date: date (nullable = true)
|-- i_rec_end_date: date (nullable = true)
|-- i_item_desc: string (nullable = true)
|-- i_current_price: decimal (nullable = true)
|-- i_wholesale_cost: decimal (nullable = true)
|-- i_brand_id: integer (nullable = true)
|-- i_brand: string (nullable = true)
|-- i_class_id: integer (nullable = true)
|-- i_class: string (nullable = true)
|-- i_category_id: integer (nullable = true)
|-- i_category: string (nullable = true)
|-- i_manufact_id: integer (nullable = true)
|-- i_manufact: string (nullable = true)
|-- i_size: string (nullable = true)
|-- i_formulation: string (nullable = true)
|-- i_color: string (nullable = true)
|-- i_units: string (nullable = true)
|-- i_container: string (nullable = true)
|-- i_manager_id: integer (nullable = true)
|-- i_product_name: string (nullable = true)

Regards,
Ishwardeep

From: ankitjindal [via Apache Spark User List] 
[mailto:ml-node+s1001560n22766...@n3.nabble.com]
Sent: Tuesday, May 5, 2015 5:00 PM
To: Ishwardeep Singh
Subject: RE: Unable to join table across data sources using sparkSQL

Just check the Schema of both the tables using frame.printSchema();

If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-join-table-across-data-sources-using-sparkSQL-tp22761p22766.html
To unsubscribe from Unable to join table across data sources using sparkSQL, 
click 
herehttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=22761code=aXNod2FyZGVlcC5zaW5naEBpbXBldHVzLmNvLmlufDIyNzYxfDgzMDExNzI4OQ==.
NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml








NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for

Re: Maximum Core Utilization

2015-05-05 Thread ayan guha

Also, if not already done,  you may want to try repartition your data to 50
partition s
On 6 May 2015 05:56, Manu Kaul manohar.k...@gmail.com wrote:

 Hi All,
 For a job I am running on Spark with a dataset of say 350,000 lines (not
 big), I am finding that even though my cluster has a large number of cores
 available (like 100 cores), the Spark system seems to stop after using just
 4 cores and after that the runtime is pretty much a straight line no matter
 how many more cores are thrown at it. I am wondering if Spark tries to
 figure out the maximum no. of cores to use based on the size of the
 dataset? If yes, is there a way to disable this feature and force it to use
 all the cores available?

 Thanks,
 Manu

 --

 The greater danger for most of us lies not in setting our aim too high and
 falling short; but in setting our aim too low, and achieving our mark.
 - Michelangelo

AvroFiles

2015-05-05 Thread Pankaj Deshpande

Hi I am using Spark 1.3.1 to read an avro file stored on HDFS. The avro
file was created using Avro 1.7.7. Similar to the example mentioned in
http://www.infoobjects.com/spark-with-avro/
I am getting a nullPointerException on Schema read. It could be a avro
version mismatch. Has anybody had a similar issue with avro.


Thanks

Re: what does Container exited with a non-zero exit code 10 means?

2015-05-05 Thread Marcelo Vanzin

What Spark tarball are you using? You may want to try the one for hadoop
2.6 (the one for hadoop 2.4 may cause that issue, IIRC).

On Tue, May 5, 2015 at 6:54 PM, felicia shsh...@tsmc.com wrote:

 Hi all,

 We're trying to implement SparkSQL on CDH5.3.0 with cluster mode,
 and we get this error either using java or python;


 Application application_1430482716098_0607 failed 2 times due to AM
 Container for appattempt_1430482716098_0607_02 exited with exitCode: 10
 due to: Exception from container-launch.
 Container id: container_e10_1430482716098_0607_02_01
 Exit code: 10
 Stack trace: ExitCodeException exitCode=10:
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
 at org.apache.hadoop.util.Shell.run(Shell.java:455)
 at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
 at

 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:197)
 at

 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
 at

 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 Container exited with a non-zero exit code 10
 .Failing this attempt.. Failing the application.


 where the detail log in nodes are:

 ERROR yarn.ApplicationMaster: Uncaught exception:
 java.lang.IllegalArgumentException: Invalid ContainerId:
 container_e10_1430482716098_0607_02_01
at

 org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:182)
at

 org.apache.spark.deploy.yarn.YarnRMClient.getAttemptId(YarnRMClient.scala:93)
at

 org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:83)
at

 org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:576)
at

 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
at

 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at

 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at

 org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
at

 org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:574)
at

 org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:597)
at
 org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
 Caused by: java.lang.NumberFormatException: For input string: e10
at

 java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:589)
at java.lang.Long.parseLong(Long.java:631)
at

 org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:137)
at

 org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:177)
... 12 more



 we’ve already tried the solution described as following but it doesn’t
 seems
 to work.
 Please advise if there’s any environmental settings we should exploit for
 clarifying our question, thanks!
 https://github.com/abhibasu/sparksql/wiki/SparkSQL-Configuration-in-CDH-5.3




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/what-does-Container-exited-with-a-non-zero-exit-code-10-means-tp22778.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Marcelo

Re: AvroFiles

2015-05-05 Thread Pankaj Deshpande

I am not using kyro. I was using the regular sqlcontext.avrofiles to open.
The files loads properly with the schema. Exception happens when I try to
read it. Will try  kyro serializer and see if that helps.
On May 5, 2015 9:02 PM, Todd Nist tsind...@gmail.com wrote:

 Are you using Kryo or Java serialization? I found this post useful:


 http://stackoverflow.com/questions/23962796/kryo-readobject-cause-nullpointerexception-with-arraylist

 If using kryo, you need to register the classes with kryo, something like
 this:


 sc.registerKryoClasses(Array(
 classOf[ConfigurationProperty],
classOf[Event]
 ))

 Or create a registrator something like this:

 class ODSKryoRegistrator extends KryoRegistrator {
   override def registerClasses(kryo: Kryo) {
 kryo.register(classOf[ConfigurationProperty], new 
 AvroSerializer[ConfigurationProperty]())
 kryo.register(classOf[Event], new AvroSerializer[Event]()))
   }

 I encountered a similar error since several of the Avor core classes are
 not marked Serializable.

 HTH.

 Todd

 On Tue, May 5, 2015 at 7:09 PM, Pankaj Deshpande ppa...@gmail.com wrote:

 Hi I am using Spark 1.3.1 to read an avro file stored on HDFS. The avro
 file was created using Avro 1.7.7. Similar to the example mentioned in
 http://www.infoobjects.com/spark-with-avro/
 I am getting a nullPointerException on Schema read. It could be a avro
 version mismatch. Has anybody had a similar issue with avro.


 Thanks

Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread Sudarshan Murty

Another thing - could it be a permission problem ?
It creates all the directory structure (in red)/tmp/wordcount/
_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1
so I am guessing not.

On Tue, May 5, 2015 at 7:27 PM, Sudarshan Murty njmu...@gmail.com wrote:

You are most probably right. I assumed others may have run into this.
When I try to put the files in there, it creates a directory structure
with the part-0 and part1 files but these files are of size 0 - no
content. The client error and the server logs have the error message shown
- which seem to indicate that the system is aware that a datanode exists
but is excluded from the operation. So, it looks like it is not partitioned
and Ambari indicates that HDFS is in good health with one NN, one SN, one
DN.
I am unable to figure out what the issue is.
thanks for your help.

On Tue, May 5, 2015 at 6:39 PM, ayan guha guha.a...@gmail.com wrote:

What happens when you try to put files to your hdfs from local
filesystem? Looks like its a hdfs issue rather than spark thing.
On 6 May 2015 05:04, Sudarshan njmu...@gmail.com wrote:

I have searched all replies to this question not found an answer.

I am running standalone Spark 1.3.1 and Hortonwork's HDP 2.2 VM, side by
side, on the same machine and trying to write output of wordcount program
into HDFS (works fine writing to a local file, /tmp/wordcount).

Only line I added to the wordcount program is: (where 'counts' is the
JavaPairRDD)
*counts.saveAsTextFile(hdfs://sandbox.hortonworks.com:8020/tmp/wordcount
http://sandbox.hortonworks.com:8020/tmp/wordcount);*

When I check in HDFS at that location (/tmp) here's what I find.
/tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_00_2/part-0
and
/tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1

and *both part-000[01] are 0 size files*.

The wordcount client output error is:
[Stage 1: (0 + 2)
/ 2]15/05/05 14:40:45 WARN DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1
*could only be replicated to 0 nodes instead of minReplication (=1).
There are 1 datanode(s) running and 1 node(s) are excluded in this
operation.*
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1550)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3447)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:642)

I tried this with Spark 1.2.1 same error.
I have plenty of space on the DFS.
The Name Node, Sec Name Node the one Data Node are all healthy.

Any hint as to what may be the problem ?
thanks in advance.
Sudarshan

--
View this message in context: saveAsTextFile() to save output of Spark
program to HDFS
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-to-save-output-of-Spark-program-to-HDFS-tp22774.html
Sent from the Apache Spark User List mailing list archive
http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.

RE: Multilabel Classification in spark

2015-05-05 Thread Ulanov, Alexander

If you are interested in multilabel (not multiclass), you might want to take a 
look at SPARK-7015 https://github.com/apache/spark/pull/5830/files. It is 
supposed to perform one-versus-all transformation on classes, which is usually 
how multilabel classifiers are built.

Alexander

From: Joseph Bradley [mailto:jos...@databricks.com]
Sent: Tuesday, May 05, 2015 3:44 PM
To: DB Tsai
Cc: peterg; user@spark.apache.org
Subject: Re: Multilabel Classification in spark

If you mean multilabel (predicting multiple label values), then MLlib does 
not yet support that.  You would need to predict each label separately.

If you mean multiclass (1 label taking 2 categorical values), then MLlib 
supports it via LogisticRegression (as DB said), as well as DecisionTree and 
RandomForest.

Joseph

On Tue, May 5, 2015 at 1:27 PM, DB Tsai 
dbt...@dbtsai.commailto:dbt...@dbtsai.com wrote:
LogisticRegression in MLlib package supports multilable classification.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com

On Tue, May 5, 2015 at 1:13 PM, peterg 
pe...@garbers.memailto:pe...@garbers.me wrote:
 Hi all,

 I'm looking to implement a Multilabel classification algorithm but I am
 surprised to find that there are not any in the spark-mllib core library. Am
 I missing something? Would someone point me in the right direction?

 Thanks!

 Peter

 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Multilabel-Classification-in-spark-tp22775.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: 
 user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: 
 user-h...@spark.apache.orgmailto:user-h...@spark.apache.org

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org

Re: overloaded method constructor Strategy with alternatives

2015-05-05 Thread Ted Yu

Can you give us a bit more information ?
Such as release of Spark you're using, version of Scala, etc.

Thanks

On Tue, May 5, 2015 at 6:37 PM, xweb ashish8...@gmail.com wrote:

 I am getting on following code
 Error:(164, 25) *overloaded method constructor Strategy with alternatives:*
   (algo: org.apache.spark.mllib.tree.configuration.Algo.Algo,impurity:
 org.apache.spark.mllib.tree.impurity.Impurity,maxDepth: Int,numClasses:
 Int,maxBins: Int,categoricalFeaturesInfo:

 java.util.Map[Integer,Integer])org.apache.spark.mllib.tree.configuration.Strategy
 and
   (algo: org.apache.spark.mllib.tree.configuration.Algo.Algo,impurity:
 org.apache.spark.mllib.tree.impurity.Impurity,maxDepth: Int,numClasses:
 Int,maxBins: Int,quantileCalculationStrategy:

 org.apache.spark.mllib.tree.configuration.QuantileStrategy.QuantileStrategy,categoricalFeaturesInfo:
 scala.collection.immutable.Map[Int,Int],minInstancesPerNode:
 Int,minInfoGain: Double,maxMemoryInMB: Int,subsamplingRate:
 Double,useNodeIdCache: Boolean,checkpointInterval:
 Int)org.apache.spark.mllib.tree.configuration.Strategy
  cannot be applied to
 (org.apache.spark.mllib.tree.configuration.Algo.Value,
 org.apache.spark.mllib.tree.impurity.Gini.type, Int, Int, Int,
 scala.collection.immutable.Map[Int,Int])
 val dTreeStrategy = new Strategy(algo, impurity, maxDepth, numClasses,
 maxBins, categoricalFeaturesInfo)
 ^
 code
 val categoricalFeaturesInfo = Map[Int, Int]()
 val impurity = Gini
 val maxDepth = 4
 val maxBins = 32
 val algo = Algo.Classification

 val numClasses = 7

 val dTreeStrategy = new Strategy(algo, impurity, maxDepth, numClasses,
 maxBins, categoricalFeaturesInfo)
 /code



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/overloaded-method-constructor-Strategy-with-alternatives-tp22777.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Using spark streaming to load data from Kafka to HDFS

2015-05-05 Thread Rendy Bambang Junior

Hi all,

I am planning to load data from Kafka to HDFS. Is it normal to use spark
streaming to load data from Kafka to HDFS? What are concerns on doing this?

There are no processing to be done by Spark, only to store data to HDFS
from Kafka for storage and for further Spark processing

Rendy

Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread Sudarshan Murty

You are most probably right. I assumed others may have run into this.
When I try to put the files in there, it creates a directory structure with
the part-0 and part1 files but these files are of size 0 - no
content. The client error and the server logs have the error message shown
- which seem to indicate that the system is aware that a datanode exists
but is excluded from the operation. So, it looks like it is not partitioned
and Ambari indicates that HDFS is in good health with one NN, one SN, one
DN.
I am unable to figure out what the issue is.
thanks for your help.

On Tue, May 5, 2015 at 6:39 PM, ayan guha guha.a...@gmail.com wrote:

What happens when you try to put files to your hdfs from local filesystem?
Looks like its a hdfs issue rather than spark thing.
On 6 May 2015 05:04, Sudarshan njmu...@gmail.com wrote:

I have searched all replies to this question not found an answer.

and *both part-000[01] are 0 size files*.

The wordcount client output error is:
[Stage 1: (0 + 2)
/ 2]15/05/05 14:40:45 WARN DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1
*could only be replicated to 0 nodes instead of minReplication (=1). There
are 1 datanode(s) running and 1 node(s) are excluded in this operation.*
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1550)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3447)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:642)

I tried this with Spark 1.2.1 same error.
I have plenty of space on the DFS.
The Name Node, Sec Name Node the one Data Node are all healthy.

Any hint as to what may be the problem ?
thanks in advance.
Sudarshan

Re: Join between Streaming data vs Historical Data in spark

2015-05-05 Thread Rendy Bambang Junior

Thanks.

Since join will be done in regular basis in short period of time ( let say
20s) do you have any suggestions how to make it faster?

I am thinking of partitioning data set and cache it.

Rendy
On Apr 30, 2015 6:31 AM, Tathagata Das t...@databricks.com wrote:

 Have you taken a look at the join section in the streaming programming
 guide?


 http://spark.apache.org/docs/latest/streaming-programming-guide.html#stream-dataset-joins

 On Wed, Apr 29, 2015 at 7:11 AM, Rendy Bambang Junior 
 rendy.b.jun...@gmail.com wrote:

 Let say I have transaction data and visit data

 visit
 | userId | Visit source | Timestamp |
 | A  | google ads   | 1 |
 | A  | facebook ads | 2 |

 transaction
 | userId | total price | timestamp |
 | A  | 100 | 248384|
 | B  | 200 | 43298739  |

 I want to join transaction data and visit data to do sales attribution. I
 want to do it realtime whenever transaction occurs (streaming).

 Is it scalable to do join between one data and very big historical data
 using join function in spark? If it is not, then how it usually be done?

 Visit needs to be historical, since visit can be anytime before
 transaction (e.g. visit is one year before transaction occurs)

 Rendy

MLlib libsvm isssues with data

2015-05-05 Thread doyere

hi all:
I’ve met a issues with MLlib.I used posted to the community seems put the wrong 
place:( .Then I put in stackoverflowf.for a good format details plz 
seehttp://stackoverflow.com/questions/30048344/spark-mllib-libsvm-isssues-with-data.hope
 someone could help 
I guess it’s due to my data.but I’ve test it in libsvm-tools it worked well,and 
I’ve used the libsvm data python data format test tool and it’s ok.Just don’t 
know why it errors withjava.lang.ArrayIndexOutOfBoundsException: -1:(
And this is my first time using the mail list ask for help.If I did something 
wrong or I described not clearly plz tell me.


doye

what does Container exited with a non-zero exit code 10 means?

2015-05-05 Thread felicia

Hi all,

We're trying to implement SparkSQL on CDH5.3.0 with cluster mode, 
and we get this error either using java or python; 


Application application_1430482716098_0607 failed 2 times due to AM
Container for appattempt_1430482716098_0607_02 exited with exitCode: 10
due to: Exception from container-launch. 
Container id: container_e10_1430482716098_0607_02_01 
Exit code: 10 
Stack trace: ExitCodeException exitCode=10: 
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) 
at org.apache.hadoop.util.Shell.run(Shell.java:455) 
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702) 
at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:197)
 
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
 
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
 
at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745) 
Container exited with a non-zero exit code 10 
.Failing this attempt.. Failing the application. 


where the detail log in nodes are:

ERROR yarn.ApplicationMaster: Uncaught exception: 
java.lang.IllegalArgumentException: Invalid ContainerId:
container_e10_1430482716098_0607_02_01
   at
org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:182)
   at
org.apache.spark.deploy.yarn.YarnRMClient.getAttemptId(YarnRMClient.scala:93)
   at
org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:83)
   at
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:576)
   at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
   at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:422)
   at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
   at
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
   at
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:574)
   at
org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:597)
   at
org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
Caused by: java.lang.NumberFormatException: For input string: e10
   at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
   at java.lang.Long.parseLong(Long.java:589)
   at java.lang.Long.parseLong(Long.java:631)
   at
org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:137)
   at
org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:177)
   ... 12 more



we’ve already tried the solution described as following but it doesn’t seems
to work. 
Please advise if there’s any environmental settings we should exploit for
clarifying our question, thanks!
https://github.com/abhibasu/sparksql/wiki/SparkSQL-Configuration-in-CDH-5.3




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/what-does-Container-exited-with-a-non-zero-exit-code-10-means-tp22778.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Two DataFrames with different schema, unionAll issue.

2015-05-05 Thread Michael Armbrust

You need to add a select clause to at least one dataframe to give them the
same schema before you can union them (much like in SQL).

On Tue, May 5, 2015 at 3:24 AM, Wilhelm niznik.pa...@gmail.com wrote:

 Hey there,

 1.) I'm loading 2 avro files with that have slightly different schema

 df1 = sqlc.load(file1, com.databricks.spark.avro)
 df2 = sqlc.load(file2, com.databricks.spark.avro)

 2.) I want to unionAll them

 nfd = dfs1.unionAll(dfs2)

 3.) Getting the following error

 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-190-a86d9adbea83 in module()
  17
  18
 --- 19 nfd = dfs1.unionAll(dfs2)
  20
  21

 /home/hadoop/spark/python/pyspark/sql/dataframe.pyc in unionAll(self,
 other)
 669 This is equivalent to `UNION ALL` in SQL.
 670 
 -- 671 return DataFrame(self._jdf.unionAll(other._jdf),
 self.sql_ctx)
 672
 673 def intersect(self, other):

 /home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
 __call__(self, *args)
 536 answer = self.gateway_client.send_command(command)
 537 return_value = get_return_value(answer,
 self.gateway_client,
 -- 538 self.target_id, self.name)
 539
 540 for temp_arg in temp_args:

 /home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in
 get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(

 Py4JJavaError: An error occurred while calling o76196.unionAll.
 : org.apache.spark.sql.AnalysisException: unresolved operator 'Union ;
 at

 org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:37)
 at

 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3.apply(CheckAnalysis.scala:97)
 at

 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3.apply(CheckAnalysis.scala:43)
 at
 org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:88)
 at

 org.apache.spark.sql.catalyst.analysis.CheckAnalysis.apply(CheckAnalysis.scala:43)
 at

 org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:1069)
 at org.apache.spark.sql.DataFrame.init(DataFrame.scala:133)
 at
 org.apache.spark.sql.DataFrame.logicalPlanToDataFrame(DataFrame.scala:157)
 at org.apache.spark.sql.DataFrame.unionAll(DataFrame.scala:641)
 at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
 at
 py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
 at py4j.Gateway.invoke(Gateway.java:259)
 at
 py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:207)
 at java.lang.Thread.run(Thread.java:745)
 ---

 4.) Is it possible to automatically merge 2 DFs with different schemas like
 that? Am I doing sth. wrong?

 Much appreciated!




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Two-DataFrames-with-different-schema-unionAll-issue-tp22765.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: overloaded method constructor Strategy with alternatives

2015-05-05 Thread Ash G

I am using Spark 1.3.0 and Scala 2.10.

Thanks

On Tue, May 5, 2015 at 6:48 PM, Ted Yu yuzhih...@gmail.com wrote:

 Can you give us a bit more information ?
 Such as release of Spark you're using, version of Scala, etc.

 Thanks

 On Tue, May 5, 2015 at 6:37 PM, xweb ashish8...@gmail.com wrote:

 I am getting on following code
 Error:(164, 25) *overloaded method constructor Strategy with
 alternatives:*
   (algo: org.apache.spark.mllib.tree.configuration.Algo.Algo,impurity:
 org.apache.spark.mllib.tree.impurity.Impurity,maxDepth: Int,numClasses:
 Int,maxBins: Int,categoricalFeaturesInfo:

 java.util.Map[Integer,Integer])org.apache.spark.mllib.tree.configuration.Strategy
 and
   (algo: org.apache.spark.mllib.tree.configuration.Algo.Algo,impurity:
 org.apache.spark.mllib.tree.impurity.Impurity,maxDepth: Int,numClasses:
 Int,maxBins: Int,quantileCalculationStrategy:

 org.apache.spark.mllib.tree.configuration.QuantileStrategy.QuantileStrategy,categoricalFeaturesInfo:
 scala.collection.immutable.Map[Int,Int],minInstancesPerNode:
 Int,minInfoGain: Double,maxMemoryInMB: Int,subsamplingRate:
 Double,useNodeIdCache: Boolean,checkpointInterval:
 Int)org.apache.spark.mllib.tree.configuration.Strategy
  cannot be applied to
 (org.apache.spark.mllib.tree.configuration.Algo.Value,
 org.apache.spark.mllib.tree.impurity.Gini.type, Int, Int, Int,
 scala.collection.immutable.Map[Int,Int])
 val dTreeStrategy = new Strategy(algo, impurity, maxDepth, numClasses,
 maxBins, categoricalFeaturesInfo)
 ^
 code
 val categoricalFeaturesInfo = Map[Int, Int]()
 val impurity = Gini
 val maxDepth = 4
 val maxBins = 32
 val algo = Algo.Classification

 val numClasses = 7

 val dTreeStrategy = new Strategy(algo, impurity, maxDepth, numClasses,
 maxBins, categoricalFeaturesInfo)
 /code



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/overloaded-method-constructor-Strategy-with-alternatives-tp22777.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Possible to use hive-config.xml instead of hive-site.xml for HiveContext?

2015-05-05 Thread nitinkak001

I am running hive queries from HiveContext, for which we need a
hive-site.xml.

Is it possible to replace it with hive-config.xml? I tried but does not
work. Just want a conformation.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Possible-to-use-hive-config-xml-instead-of-hive-site-xml-for-HiveContext-tp22776.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: AvroFiles

2015-05-05 Thread Todd Nist

Are you using Kryo or Java serialization? I found this post useful:

http://stackoverflow.com/questions/23962796/kryo-readobject-cause-nullpointerexception-with-arraylist

If using kryo, you need to register the classes with kryo, something like
this:


sc.registerKryoClasses(Array(
classOf[ConfigurationProperty],
   classOf[Event]
))

Or create a registrator something like this:

class ODSKryoRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {
kryo.register(classOf[ConfigurationProperty], new
AvroSerializer[ConfigurationProperty]())
kryo.register(classOf[Event], new AvroSerializer[Event]()))
  }

I encountered a similar error since several of the Avor core classes are
not marked Serializable.

HTH.

Todd

On Tue, May 5, 2015 at 7:09 PM, Pankaj Deshpande ppa...@gmail.com wrote:

 Hi I am using Spark 1.3.1 to read an avro file stored on HDFS. The avro
 file was created using Avro 1.7.7. Similar to the example mentioned in
 http://www.infoobjects.com/spark-with-avro/
 I am getting a nullPointerException on Schema read. It could be a avro
 version mismatch. Has anybody had a similar issue with avro.


 Thanks

Re: Using spark streaming to load data from Kafka to HDFS

2015-05-05 Thread MrAsanjar .

why not try https://github.com/linkedin/camus - camus is kafka to HDFS
pipeline

On Tue, May 5, 2015 at 11:13 PM, Rendy Bambang Junior 
rendy.b.jun...@gmail.com wrote:

 Hi all,

 I am planning to load data from Kafka to HDFS. Is it normal to use spark
 streaming to load data from Kafka to HDFS? What are concerns on doing this?

 There are no processing to be done by Spark, only to store data to HDFS
 from Kafka for storage and for further Spark processing

 Rendy

Re: How to skip corrupted avro files

2015-05-05 Thread Imran Rashid

You might be interested in https://issues.apache.org/jira/browse/SPARK-6593
and the discussion around the PRs.

This is probably more complicated than what you are looking for, but you
could copy the code for HadoopReliableRDD in the PR into your own code and
use it, without having to wait for the issue to get resolved.

On Sun, May 3, 2015 at 12:57 PM, Shing Hing Man mat...@yahoo.com.invalid
wrote:


 Hi,
  I am using Spark 1.3.1 to read a directory of about 2000 avro files.
 The avro files are from a third party and a few of them are corrupted.

   val path = {myDirecotry of avro files}

  val sparkConf = new SparkConf().setAppName(avroDemo).setMaster(local)
   val sc = new SparkContext(sparkConf)

  val sqlContext = new SQLContext(sc)

  val data = sqlContext.avroFile(path);
  data.select(.)

  When I run the above code, I get the following exception.
  org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
  at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:222)
 ~[classes/:1.7.7]
  at org.apache.avro.mapred.AvroRecordReader.next(AvroRecordReader.java:64)
 ~[avro-mapred-1.7.7-hadoop2.jar:1.7.7]
  at org.apache.avro.mapred.AvroRecordReader.next(AvroRecordReader.java:32)
 ~[avro-mapred-1.7.7-hadoop2.jar:1.7.7]
  at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:245)
 ~[spark-core_2.10-1.3.1.jar:1.3.1]
  at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:212)
 ~[spark-core_2.10-1.3.1.jar:1.3.1]
  at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 ~[spark-core_2.10-1.3.1.jar:1.3.1]
  at
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 ~[spark-core_2.10-1.3.1.jar:1.3.1]
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 ~[scala-library.jar:na]
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 ~[scala-library.jar:na]
  at
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:129)
 ~[spark-sql_2.10-1.3.1.jar:1.3.1]
  at
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:126)
 ~[spark-sql_2.10-1.3.1.jar:1.3.1]
  at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
 ~[spark-core_2.10-1.3.1.jar:1.3.1]
  at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
 ~[spark-core_2.10-1.3.1.jar:1.3.1]
  at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 ~[spark-core_2.10-1.3.1.jar:1.3.1]
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 ~[spark-core_2.10-1.3.1.jar:1.3.1]
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 ~[spark-core_2.10-1.3.1.jar:1.3.1]
  at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 ~[spark-core_2.10-1.3.1.jar:1.3.1]
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 ~[spark-core_2.10-1.3.1.jar:1.3.1]
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 ~[spark-core_2.10-1.3.1.jar:1.3.1]
  at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 ~[spark-core_2.10-1.3.1.jar:1.3.1]
  at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 ~[spark-core_2.10-1.3.1.jar:1.3.1]
  at org.apache.spark.scheduler.Task.run(Task.scala:64)
 ~[spark-core_2.10-1.3.1.jar:1.3.1]
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 ~[spark-core_2.10-1.3.1.jar:1.3.1]
  at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 [na:1.7.0_71]
  at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 [na:1.7.0_71]
  at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71]
 Caused by: java.io.IOException: Invalid sync!
  at
 org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:314)
 ~[classes/:1.7.7]
  at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:209)
 ~[classes/:1.7.7]
  ... 25 common frames omitted

Is there an easy way to skip a corrupted avro file without reading the
 files one by one using sqlContext.avroFile(file) ?
  At present, my solution (hack)  is to have my own version of
 org.apache.avro.file.DataFileStream with method hasNext returns false (
 to signal the end file), when
  java.io.IOException: Invalid sync!
   is thrown.
   Please see  line 210 in

 https://github.com/apache/avro/blob/branch-1.7/lang/java/avro/src/main/java/org/apache/avro/file/DataFileStream.java

   Thanks in advance for any assistance !
   Shing

Re: Parquet number of partitions

2015-05-05 Thread Masf

Hi Eric.

Q1:
When I read parquet files, I've tested that Spark generates so many
partitions as parquet files exist in the path.

Q2:
To reduce the number of partitions you can use rdd.repartition(x), x=
number of partitions. Depend on your case, repartition could be a heavy task


Regards.
Miguel.

On Tue, May 5, 2015 at 3:56 PM, Eric Eijkelenboom 
eric.eijkelenb...@gmail.com wrote:

 Hello guys

 Q1: How does Spark determine the number of partitions when reading a
 Parquet file?

 val df = sqlContext.parquetFile(path)

 Is it some way related to the number of Parquet row groups in my input?

 Q2: How can I reduce this number of partitions? Doing this:

 df.rdd.coalesce(200).count

 from the spark-shell causes job execution to hang…

 Any ideas? Thank you in advance.

 Eric
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 


Saludos.
Miguel Ángel

Re: JAVA for SPARK certification

2015-05-05 Thread ayan guha

Very interested @Kartik/Zoltan. Please let me know how to connect on LI

On Tue, May 5, 2015 at 11:47 PM, Zoltán Zvara zoltan.zv...@gmail.com
wrote:

 I might join in to this conversation with an ask. Would someone point me
 to a decent exercise that would approximate the level of this exam (from
 above)? Thanks!

 On Tue, May 5, 2015 at 3:37 PM Kartik Mehta kartik.meht...@gmail.com
 wrote:

 Production - not whole lot of companies have implemented Spark in
 production and so though it is good to have, not must.

 If you are on LinkedIn, a group of folks including myself are preparing
 for Spark certification, learning in group makes learning easy and fun.

 Kartik
 On May 5, 2015 7:31 AM, ayan guha guha.a...@gmail.com wrote:

 And how important is to have production environment?
 On 5 May 2015 20:51, Stephen Boesch java...@gmail.com wrote:

 There are questions in all three languages.

 2015-05-05 3:49 GMT-07:00 Kartik Mehta kartik.meht...@gmail.com:

 I too have similar question.

 My understanding is since Spark written in scala, having done in Scala
 will be ok for certification.

 If someone who has done certification can confirm.

 Thanks,

 Kartik
 On May 5, 2015 5:57 AM, Gourav Sengupta gourav.sengu...@gmail.com
 wrote:

 Hi,

 how important is JAVA for Spark certification? Will learning only
 Python and Scala not work?


 Regards,
 Gourav





-- 
Best Regards,
Ayan Guha

Re: How to deal with code that runs before foreach block in Apache Spark?

2015-05-05 Thread Imran Rashid

Gerard is totally correct -- to expand a little more, I think what you want
to do is a solrInputDocumentJavaRDD.foreachPartition, instead of
solrInputDocumentJavaRDD.foreach:


solrInputDocumentJavaRDD.foreachPartition(
  new VoidFunctionIteratorSolrInputDocument() {
@Override
public void call(IteratorSolrInputDocument docItr) {
  ListSolrInputDocument docs = new ArrayListSolrInputDocument();
  for(SolrInputDocument solrInputDocument: docItr) {
// Add the solrInputDocument to the list of SolrInputDocuments
docs.add(solrInputDocument);
  }
  // push things to solr **from the executor, for this partition**
  // so for this make sense, you need to be sure solr can handle a bunch
  // of executors pushing into it simultaneously.
  addThingsToSolr(docs);
}
});

On Mon, May 4, 2015 at 8:44 AM, Gerard Maas gerard.m...@gmail.com wrote:

 I'm not familiar with the Solr API but provided that ' SolrIndexerDriver'
 is a singleton, I guess that what's going on when running on a cluster is
 that the call to:

  SolrIndexerDriver.solrInputDocumentList.add(elem)

 is happening on different singleton instances of the  SolrIndexerDriver
 on different JVMs while

 SolrIndexerDriver.solrServer.commit

 is happening on the driver.

 In practical terms, the lists on the executors are being filled-in but
 they are never committed and on the driver the opposite is happening.

 -kr, Gerard

 On Mon, May 4, 2015 at 3:34 PM, Emre Sevinc emre.sev...@gmail.com wrote:

 I'm trying to deal with some code that runs differently on Spark
 stand-alone mode and Spark running on a cluster. Basically, for each item
 in an RDD, I'm trying to add it to a list, and once this is done, I want to
 send this list to Solr.

 This works perfectly fine when I run the following code in stand-alone
 mode of Spark, but does not work when the same code is run on a cluster.
 When I run the same code on a cluster, it is like send to Solr part of
 the code is executed before the list to be sent to Solr is filled with
 items. I try to force the execution by solrInputDocumentJavaRDD.collect();
 after foreach, but it seems like it does not have any effect.

 // For each RDD
 solrInputDocumentJavaDStream.foreachRDD(
 new FunctionJavaRDDSolrInputDocument, Void() {
   @Override
   public Void call(JavaRDDSolrInputDocument
 solrInputDocumentJavaRDD) throws Exception {

 // For each item in a single RDD
 solrInputDocumentJavaRDD.foreach(
 new VoidFunctionSolrInputDocument() {
   @Override
   public void call(SolrInputDocument
 solrInputDocument) {

 // Add the solrInputDocument to the list of
 SolrInputDocuments

 SolrIndexerDriver.solrInputDocumentList.add(solrInputDocument);
   }
 });

 // Try to force execution
 solrInputDocumentJavaRDD.collect();


 // After having finished adding every SolrInputDocument to
 the list
 // add it to the solrServer, and commit, waiting for the
 commit to be flushed
 try {

   // Seems like when run in cluster mode, the list size is
 zero,
  // therefore the following part is never executed

   if (SolrIndexerDriver.solrInputDocumentList != null
SolrIndexerDriver.solrInputDocumentList.size() 
 0) {

 SolrIndexerDriver.solrServer.add(SolrIndexerDriver.solrInputDocumentList);
 SolrIndexerDriver.solrServer.commit(true, true);
 SolrIndexerDriver.solrInputDocumentList.clear();
   }
 } catch (SolrServerException | IOException e) {
   e.printStackTrace();
 }


 return null;
   }
 }
 );


 What should I do, so that sending-to-Solr part executes after the list of
 SolrDocuments are added to solrInputDocumentList (and works also in cluster
 mode)?


 --
 Emre Sevinç

Escaping user input for Hive queries

2015-05-05 Thread Yana Kadiyska

Hi folks, we have been using the a JDBC connection to Spark's Thrift Server
so far and using JDBC prepared statements to escape potentially malicious
user input.

I am trying to port our code directly to HiveContext now (i.e. eliminate
the use of Thrift Server) and I am not quite sure how to generate a
properly escaped sql statement...

Wondering if someone has ideas on proper way to do this?

To be concrete, I would love to issue this statement

 val df = myHiveCtxt.(sqlText)


but I would like to defend against potential SQL injection.

Possible to disable Spark HTTP server ?

2015-05-05 Thread roy

Hi,

  When we start spark job it start new HTTP server for each new job.
Is it possible to disable HTTP server for each job ?


Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Possible-to-disable-Spark-HTTP-server-tp22772.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark job concurrency problem

2015-05-05 Thread Imran Rashid

can you give your entire spark submit command?  Are you missing
--executor-cores num_cpu?  Also, if you intend to use all 6 nodes, you
also need --num-executors 6

On Mon, May 4, 2015 at 2:07 AM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 I have two small RDD, each has about 600 records. In my code, I did

 val rdd1 = sc...cache()
 val rdd2 = sc...cache()

 val result = rdd1.cartesian(rdd2).*repartition*(num_cpu).map {case (a,b)
 =
   some_expensive_job(a,b)
 }

 I ran my job in YARN cluster with --master yarn-cluster, I have 6
 executor, and each has a large memory volume.

 However, I noticed my job is very slow. I went to the RM page, and found
 there are two containers, one is the driver, one is the worker. I guess
 this is correct?

 I went to the worker's log, and monitor the log detail. My app print some
 information, so I can use them to estimate the progress of the map
 operation. Looking at the log, it feels like the jobs are done one by one
 sequentially, rather than #cpu batch at a time.

 I checked the worker node, and their CPU are all busy.



 [image: --]
 Xi Shen
 [image: http://]about.me/davidshen
 http://about.me/davidshen?promo=email_sig
   http://about.me/davidshen

Where does Spark persist RDDs on disk?

2015-05-05 Thread Haoliang Quan

Hi,

I'm using persist on different storage levels, but I found no difference on
performance when I was using MEMORY_ONLY and DISK_ONLY. I think there might
be something wrong with my code... So where can I find the persisted RDDs
on disk so that I can make sure they were persisted indeed?

Thank you.

Where does Spark persist RDDs on disk?

2015-05-05 Thread hquan

Hi, 

I'm using persist on different storage levels, but I found no difference on
performance when I was using MEMORY_ONLY and DISK_ONLY. I think there might
be something wrong with my code... So where can I find the persisted RDDs on
disk so that I can make sure they were persisted indeed? 

Thank you.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Where-does-Spark-persist-RDDs-on-disk-tp22771.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index:

2015-05-05 Thread Imran Rashid

Are you setting a really large max buffer size for kryo?
Was this fixed by https://issues.apache.org/jira/browse/SPARK-6405 ?


If not, we should open up another issue to get a better warning in these
cases.

On Tue, May 5, 2015 at 2:47 AM, shahab shahab.mok...@gmail.com wrote:

 Thanks Tristan for sharing this. Actually this happens when I am reading a
 csv file of 3.5 GB.

 best,
 /Shahab



 On Tue, May 5, 2015 at 9:15 AM, Tristan Blakers tris...@blackfrog.org
 wrote:

 Hi Shahab,

 I’ve seen exceptions very similar to this (it also manifests as negative
 array size exception), and I believe it’s a really bug in Kryo.

 See this thread:

 http://mail-archives.us.apache.org/mod_mbox/spark-user/201502.mbox/%3ccag02ijuw3oqbi2t8acb5nlrvxso2xmas1qrqd_4fq1tgvvj...@mail.gmail.com%3E

 Manifests in all of the following situations when working with an object
 graph in excess of a few GB: Joins, Broadcasts, and when using the hadoop
 save APIs.

 Tristan


 On 3 May 2015 at 07:26, Olivier Girardot ssab...@gmail.com wrote:

 Can you post your code, otherwise there's not much we can do.

 Regards,

 Olivier.

 Le sam. 2 mai 2015 à 21:15, shahab shahab.mok...@gmail.com a écrit :

 Hi,

 I am using sprak-1.2.0 and I used Kryo serialization but I get the
 following excepton.

 java.io.IOException: com.esotericsoftware.kryo.KryoException:
 java.lang.IndexOutOfBoundsException: Index: 3448, Size: 1

 I do apprecciate if anyone could tell me how I can resolve this?

 best,
 /Shahab

Re: How to separate messages of different topics.

2015-05-05 Thread Cody Koeninger

Make sure to read
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md

The directStream / KafkaRDD has a 1 : 1 relationship between kafka
topic/partition and spark partition.  So a given spark partition only has
messages from 1 kafka topic.  You can tell what topic that is using
HasOffsetRanges, as discussed in the post.

This 1 : 1 relationship only holds until the first transformation that
incurs a shuffle.

On Tue, May 5, 2015 at 8:29 AM, Guillermo Ortiz konstt2...@gmail.com
wrote:

 I want to read from many topics in Kafka and know from where each message
 is coming (topic1, topic2 and so on).

  val kafkaParams = Map[String, String](metadata.broker.list -
 myKafka:9092)
  val topics = Set(EntryLog, presOpManager)
  val directKafkaStream = KafkaUtils.createDirectStream[String, String,
 StringDecoder, StringDecoder](ssc, kafkaParams, topics)

  Is there some way to separate the messages for topics with just one
 directStream? or should I create different streamings for each topic?

Parquet number of partitions

2015-05-05 Thread Eric Eijkelenboom

Hello guys

Q1: How does Spark determine the number of partitions when reading a Parquet 
file?

val df = sqlContext.parquetFile(path)

Is it some way related to the number of Parquet row groups in my input?

Q2: How can I reduce this number of partitions? Doing this:

df.rdd.coalesce(200).count

from the spark-shell causes job execution to hang… 

Any ideas? Thank you in advance. 

Eric
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: JAVA for SPARK certification

2015-05-05 Thread Gourav Sengupta

Hi,

I think all the required materials for reference are mentioned here:
http://www.oreilly.com/data/sparkcert.html?cmp=ex-strata-na-lp-na_apache_spark_certification

My question was regarding the proficiency level required for Java. There
are detailed examples and code mentioned for JAVA, Python and Scala in most
of the SCALA tutorials mentioned in the above link for reference.

Regards,
Gourav

On Tue, May 5, 2015 at 3:03 PM, ayan guha guha.a...@gmail.com wrote:

Very interested @Kartik/Zoltan. Please let me know how to connect on LI

On Tue, May 5, 2015 at 11:47 PM, Zoltán Zvara zoltan.zv...@gmail.com
wrote:

I might join in to this conversation with an ask. Would someone point me
to a decent exercise that would approximate the level of this exam (from
above)? Thanks!

On Tue, May 5, 2015 at 3:37 PM Kartik Mehta kartik.meht...@gmail.com
wrote:

Production - not whole lot of companies have implemented Spark in
production and so though it is good to have, not must.

If you are on LinkedIn, a group of folks including myself are preparing
for Spark certification, learning in group makes learning easy and fun.

Kartik
On May 5, 2015 7:31 AM, ayan guha guha.a...@gmail.com wrote:

And how important is to have production environment?
On 5 May 2015 20:51, Stephen Boesch java...@gmail.com wrote:

There are questions in all three languages.

2015-05-05 3:49 GMT-07:00 Kartik Mehta kartik.meht...@gmail.com:

I too have similar question.

My understanding is since Spark written in scala, having done in
Scala will be ok for certification.

If someone who has done certification can confirm.

Thanks,

Kartik
On May 5, 2015 5:57 AM, Gourav Sengupta gourav.sengu...@gmail.com
wrote:

Hi,

how important is JAVA for Spark certification? Will learning only
Python and Scala not work?

Regards,
Gourav

--
Best Regards,
Ayan Guha

[ANNOUNCE] Ending Java 6 support in Spark 1.5 (Sep 2015)

2015-05-05 Thread Reynold Xin

Hi all,

We will drop support for Java 6 starting Spark 1.5, tentative scheduled to
be released in Sep 2015. Spark 1.4, scheduled to be released in June 2015,
will be the last minor release that supports Java 6. That is to say:

Spark 1.4.x (~ Jun 2015): will work with Java 6, 7, 8.

Spark 1.5+ (~ Sep 2015): will NOT work with Java 6, but work with Java 7, 8.


PS: Oracle ended Java 6 updates in Feb 2013.

Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread ayan guha

What happens when you try to put files to your hdfs from local filesystem?
Looks like its a hdfs issue rather than spark thing.
On 6 May 2015 05:04, Sudarshan njmu...@gmail.com wrote:


 I have searched all replies to this question  not found an answer.

 I am running standalone Spark 1.3.1 and Hortonwork's HDP 2.2 VM, side by 
 side, on the same machine and trying to write output of wordcount program 
 into HDFS (works fine writing to a local file, /tmp/wordcount).

 Only line I added to the wordcount program is: (where 'counts' is the 
 JavaPairRDD)
 *counts.saveAsTextFile(hdfs://sandbox.hortonworks.com:8020/tmp/wordcount 
 http://sandbox.hortonworks.com:8020/tmp/wordcount);*

 When I check in HDFS at that location (/tmp) here's what I find.
 /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_00_2/part-0
 and
 /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1

 and *both part-000[01] are 0 size files*.

 The wordcount client output error is:
 [Stage 1:  (0 + 2) / 
 2]15/05/05 14:40:45 WARN DFSClient: DataStreamer Exception
 org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
 /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1
  *could only be replicated to 0 nodes instead of minReplication (=1).  There 
 are 1 datanode(s) running and 1 node(s) are excluded in this operation.*
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1550)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3447)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:642)


 I tried this with Spark 1.2.1 same error.
 I have plenty of space on the DFS.
 The Name Node, Sec Name Node  the one Data Node are all healthy.

 Any hint as to what may be the problem ?
 thanks in advance.
 Sudarshan


 --
 View this message in context: saveAsTextFile() to save output of Spark
 program to HDFS
 http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-to-save-output-of-Spark-program-to-HDFS-tp22774.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.

Re: Multilabel Classification in spark

2015-05-05 Thread Joseph Bradley

If you mean multilabel (predicting multiple label values), then MLlib
does not yet support that.  You would need to predict each label separately.

If you mean multiclass (1 label taking 2 categorical values), then MLlib
supports it via LogisticRegression (as DB said), as well as DecisionTree
and RandomForest.

Joseph

On Tue, May 5, 2015 at 1:27 PM, DB Tsai dbt...@dbtsai.com wrote:

 LogisticRegression in MLlib package supports multilable classification.

 Sincerely,

 DB Tsai
 ---
 Blog: https://www.dbtsai.com


 On Tue, May 5, 2015 at 1:13 PM, peterg pe...@garbers.me wrote:
  Hi all,
 
  I'm looking to implement a Multilabel classification algorithm but I am
  surprised to find that there are not any in the spark-mllib core
 library. Am
  I missing something? Would someone point me in the right direction?
 
  Thanks!
 
  Peter
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Multilabel-Classification-in-spark-tp22775.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Spark SQL Standalone mode missing parquet?

2015-05-05 Thread Manu Mukerji

Hi All,

When I try and run Spark SQL in standalone mode it appears to be missing
the parquet jar, I have to pass it as -jars and that works..

sbin/start-thriftserver.sh --jars lib/parquet-hive-bundle-1.6.0.jar
--driver-memory 28g --master local[10]

Any ideas on why? I downloaded the one pre built for Hadoop 2.6 and later

thanks,
Manu

Re: Possible to disable Spark HTTP server ?

2015-05-05 Thread Ted Yu

SPARK-3490 introduced spark.ui.enabled

FYI

On Tue, May 5, 2015 at 8:41 AM, roy rp...@njit.edu wrote:

 Hi,

   When we start spark job it start new HTTP server for each new job.
 Is it possible to disable HTTP server for each job ?


 Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Possible-to-disable-Spark-HTTP-server-tp22772.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: How to skip corrupted avro files

2015-05-05 Thread Shing Hing Man

Thanks for the info !
Shing
 


 On Tuesday, 5 May 2015, 15:11, Imran Rashid iras...@cloudera.com wrote:
   

 You might be interested in https://issues.apache.org/jira/browse/SPARK-6593 
and the discussion around the PRs.
This is probably more complicated than what you are looking for, but you could 
copy the code for HadoopReliableRDD in the PR into your own code and use it, 
without having to wait for the issue to get resolved.
On Sun, May 3, 2015 at 12:57 PM, Shing Hing Man mat...@yahoo.com.invalid 
wrote:


Hi, I am using Spark 1.3.1 to read a directory of about 2000 avro files. The 
avro files are from a third party and a few of them are corrupted.
  val path = {myDirecotry of avro files}
 val sparkConf = new SparkConf().setAppName(avroDemo).setMaster(local)  val 
sc = new SparkContext(sparkConf)
 val sqlContext = new SQLContext(sc)
 val data = sqlContext.avroFile(path); data.select(.)
 When I run the above code, I get the following exception.  
org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync! at 
org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:222) 
~[classes/:1.7.7] at 
org.apache.avro.mapred.AvroRecordReader.next(AvroRecordReader.java:64) 
~[avro-mapred-1.7.7-hadoop2.jar:1.7.7] at 
org.apache.avro.mapred.AvroRecordReader.next(AvroRecordReader.java:32) 
~[avro-mapred-1.7.7-hadoop2.jar:1.7.7] at 
org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:245) 
~[spark-core_2.10-1.3.1.jar:1.3.1] at 
org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:212) 
~[spark-core_2.10-1.3.1.jar:1.3.1] at 
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) 
~[spark-core_2.10-1.3.1.jar:1.3.1] at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) 
~[spark-core_2.10-1.3.1.jar:1.3.1] at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
~[scala-library.jar:na] at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
~[scala-library.jar:na] at 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:129)
 ~[spark-sql_2.10-1.3.1.jar:1.3.1] at 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:126)
 ~[spark-sql_2.10-1.3.1.jar:1.3.1] at 
org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634) 
~[spark-core_2.10-1.3.1.jar:1.3.1] at 
org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634) 
~[spark-core_2.10-1.3.1.jar:1.3.1] at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) 
~[spark-core_2.10-1.3.1.jar:1.3.1] at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) 
~[spark-core_2.10-1.3.1.jar:1.3.1] at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:244) 
~[spark-core_2.10-1.3.1.jar:1.3.1] at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) 
~[spark-core_2.10-1.3.1.jar:1.3.1] at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) 
~[spark-core_2.10-1.3.1.jar:1.3.1] at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:244) 
~[spark-core_2.10-1.3.1.jar:1.3.1] at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) 
~[spark-core_2.10-1.3.1.jar:1.3.1] at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) 
~[spark-core_2.10-1.3.1.jar:1.3.1] at 
org.apache.spark.scheduler.Task.run(Task.scala:64) 
~[spark-core_2.10-1.3.1.jar:1.3.1] at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) 
~[spark-core_2.10-1.3.1.jar:1.3.1] at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
[na:1.7.0_71] at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
[na:1.7.0_71] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71]Caused by: 
java.io.IOException: Invalid sync! at 
org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:314) 
~[classes/:1.7.7] at 
org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:209) 
~[classes/:1.7.7] ... 25 common frames omitted
   Is there an easy way to skip a corrupted avro file without reading the files 
one by one using sqlContext.avroFile(file) ? At present, my solution (hack)  is 
to have my own version of org.apache.avro.file.DataFileStream with method 
hasNext returns false (to signal the end file), when  java.io.IOException: 
Invalid sync!   is thrown.  Please see  line 210 in  
https://github.com/apache/avro/blob/branch-1.7/lang/java/avro/src/main/java/org/apache/avro/file/DataFileStream.java
  Thanks in advance for any assistance !  Shing

Inserting Nulls

2015-05-05 Thread Masf

Hi.

I have a spark application where I store the results into table (with
HiveContext). Some of these columns allow nulls. In Scala, this columns are
represented through Option[Int] or Option[Double].. Depend on the data type.

For example:

*val hc = new HiveContext(sc)*
*var col1: Option[Ingeger] = None*
*...*

*val myRow = org.apache.spark.sql.Row(col1, ...)*

*val mySchema = StructType(Array(StructField(Column1, IntegerType,
true)))*

*val TableOutputSchemaRDD = hc.applySchema(myRow, mySchema)*
*hc.registerRDDAsTable(TableOutputSchemaRDD, result_intermediate)*
*hc.sql(CREATE TABLE table_output STORED AS PARQUET AS SELECT * FROM
result_intermediate)*

Produce:

java.lang.ClassCastException: scala.Some cannot be cast to java.lang.Integer
at
org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaIntObjectInspector.get(JavaIntObjectInspector.java:40)
at
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createPrimitive(ParquetHiveSerDe.java:247)
at
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createObject(ParquetHiveSerDe.java:301)
at
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createStruct(ParquetHiveSerDe.java:178)
at
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.serialize(ParquetHiveSerDe.java:164)
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:123)
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:114)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org
$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:114)
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93)
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)




Thanks!
--

Regards.
Miguel Ángel

Spark applications Web UI at 4040 doesn't exist

2015-05-05 Thread marco.doncel

Hi all,

I'm not able to access to the Spark Streaming running applications that I'm
submitting to the EC2 standalone cluster (spark 1.3.1) via port 4040. The
problem is that I don't even see running applications in the master's web UI
(I do see running drivers). This is the command I use to submit the app:

bin/spark-submit --class my.domain.mainClass --deploy-mode cluster --conf
spark.executor.logs.rolling.size.maxBytes=40
spark.executor.logs.rolling.strategy=size
spark.executor.logs.rolling.maxRetainedFiles=4
/home/ec2-user/application.jar

The url I'm trying to access to see that web UI is the driver EC2 public
address with port 4040. 

I also tried the solution provided in  this post
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Web-UI-is-not-showing-Running-Completed-Active-Applications-td18475.html
  
with no luck. 

Thanks in advance mates!!
Marco





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-applications-Web-UI-at-4040-doesn-t-exist-tp22773.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Number of files to load

2015-05-05 Thread Rendy Bambang Junior

Let say I am storing my data in HDFS with folder structure and file
partitioning as per below:
/analytics/2015/05/02/partition-2015-05-02-13-50-
Note that new file is created every 5 minutes.

As per my understanding, storing 5minutes file means we could not create
RDD more granular than 5minutes.

In the other hand, when we want to aggregate monthly data, number of file
will be enormous (around 84000 files).

My question is, what are the consideration to say that the number of file
to be loaded to RDD is just 'too many'? Is 84000 'too many' files?

One thing that comes to my mind is overhead when spark try to open file,
however Im not sure whether it is a valid concern.

Rendy

Re: Number of files to load

2015-05-05 Thread Jonathan Coveney

As per my understanding, storing 5minutes file means we could not create
RDD more granular than 5minutes.

This depends on the file format. Many file formats are splittable (like
parquet), meaning that you can seek into various points of the file.

2015-05-05 12:45 GMT-04:00 Rendy Bambang Junior rendy.b.jun...@gmail.com:

 Let say I am storing my data in HDFS with folder structure and file
 partitioning as per below:
 /analytics/2015/05/02/partition-2015-05-02-13-50-
 Note that new file is created every 5 minutes.

 As per my understanding, storing 5minutes file means we could not create
 RDD more granular than 5minutes.

 In the other hand, when we want to aggregate monthly data, number of file
 will be enormous (around 84000 files).

 My question is, what are the consideration to say that the number of file
 to be loaded to RDD is just 'too many'? Is 84000 'too many' files?

 One thing that comes to my mind is overhead when spark try to open file,
 however Im not sure whether it is a valid concern.

 Rendy

saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread Sudarshan

I have searched all replies to this question  not found an answer.I am
running standalone Spark 1.3.1 and Hortonwork's HDP 2.2 VM, side by side, on
the same machine and trying to write output of wordcount program into HDFS
(works fine writing to a local file, /tmp/wordcount).Only line I added to
the wordcount program is: (where 'counts' is the
JavaPairRDD)*counts.saveAsTextFile(hdfs://sandbox.hortonworks.com:8020/tmp/wordcount);*When
I check in HDFS at that location (/tmp) here's what I
find./tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_00_2/part-0and/tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1and
*both part-000[01] are 0 size files*.The wordcount client output error
is:[Stage 1:  (0 +
2) / 2]15/05/05 14:40:45 WARN DFSClient: DataStreamer
Exceptionorg.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1
*could only be replicated to 0 nodes instead of minReplication (=1).  There
are 1 datanode(s) running and 1 node(s) are excluded in this operation.*
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1550)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3447)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:642)I
tried this with Spark 1.2.1 same error.I have plenty of space on the DFS.The
Name Node, Sec Name Node  the one Data Node are all healthy.Any hint as to
what may be the problem ?thanks in advance.Sudarshan




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-to-save-output-of-Spark-program-to-HDFS-tp22774.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Inserting Nulls

2015-05-05 Thread Michael Armbrust

Option only works when you are going from case classes.  Just put null into
the Row, when you want the value to be null.

On Tue, May 5, 2015 at 9:00 AM, Masf masfwo...@gmail.com wrote:

 Hi.

 I have a spark application where I store the results into table (with
 HiveContext). Some of these columns allow nulls. In Scala, this columns are
 represented through Option[Int] or Option[Double].. Depend on the data type.

 For example:

 *val hc = new HiveContext(sc)*
 *var col1: Option[Ingeger] = None*
 *...*

 *val myRow = org.apache.spark.sql.Row(col1, ...)*

 *val mySchema = StructType(Array(StructField(Column1, IntegerType,
 true)))*

 *val TableOutputSchemaRDD = hc.applySchema(myRow, mySchema)*
 *hc.registerRDDAsTable(TableOutputSchemaRDD, result_intermediate)*
 *hc.sql(CREATE TABLE table_output STORED AS PARQUET AS SELECT * FROM
 result_intermediate)*

 Produce:

 java.lang.ClassCastException: scala.Some cannot be cast to
 java.lang.Integer
 at
 org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaIntObjectInspector.get(JavaIntObjectInspector.java:40)
 at
 org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createPrimitive(ParquetHiveSerDe.java:247)
 at
 org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createObject(ParquetHiveSerDe.java:301)
 at
 org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createStruct(ParquetHiveSerDe.java:178)
 at
 org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.serialize(ParquetHiveSerDe.java:164)
 at
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:123)
 at
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:114)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org
 $apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:114)
 at
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93)
 at
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)




 Thanks!
 --

 Regards.
 Miguel Ángel

Re: spark sql, creating literal columns in java.

2015-05-05 Thread Michael Armbrust

This should work from java too:
http://spark.apache.org/docs/1.3.1/api/java/index.html#org.apache.spark.sql.functions$

On Tue, May 5, 2015 at 4:15 AM, Jan-Paul Bultmann janpaulbultm...@me.com
wrote:

 Hey,
 What is the recommended way to create literal columns in java?
 Scala has the `lit` function from  `org.apache.spark.sql.functions`.
 Should it be called from java as well?

 Cheers jan

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Parquet Partition Strategy - how to partition data correctly

2015-05-05 Thread Todd Nist

Hi,

I have a DataFrame that represents my data looks like this:

+-++
| col_name| data_type  |
+-++
| obj_id  | string |
| type| string |
| name| string |
| metric_name | string |
| value   | double |
| ts  | timestamp  |
+-++

It is working fine, and I can store it to parquet with:

df.saveAsParquetFile(/user/data/metrics)

I would like to leverage parquet partitioning as referenced here,
https://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files

I would like to see a representation something like this:

usr
|__ data
  |__ metrics
|__ type=Virtual Machine
  |__ objId=1234
|__ metricName=CPU Demand
  |__ mmdd
|__ data.parquet
|__ metricName=CPU Utilization
  |__ mmdd
|__ data.parquet
  |__ objId=5678
|__ metricName=CPU Demand
  |__ mmdd
|__ data.parquet
|__ type=Application
  |__ objId=0009
|__ metricName=Response Time
  |__ mmdd
|__ data.parquet
|__ metricName=Slow Response
  |__ mmdd
|__ data.parquet
  |__ objId=0303
|__ metricName=Response Time
  |__ mmdd
|__ data.parquet


What is the correct way to achieve this? I can do something like:

df.map{  case Row(nodeType: String, objId: String, name: String,
metricName: String, value: Double, ts: java.sql.Timestamp) =

  ...
   // construct path
   val path = 
s/usr/data/metrics/type=${Row.nodeType}/objId=${Row.objId}/metricName=${Row.metricName}/floorToDay(ts)
   // save record as parquet
   df.saveAsParquet(path, Row)

  ...}
Is this the right approach or is there a more optimal approach?  This would save
every row as an individual file.  I will receive multiple entries for a
given metric, type and objId combination in a given day.

TIA for the assistance.

-Todd

Re: spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-05-05 Thread jeanlyn

Have you config the SPARK_CLASSPATH with the jar of mysql in spark-env.sh?For 
example (export SPARK_CLASSPATH+=:/path/to/mysql-connector-java-5.1.18-bin.jar)

 ?? 2015??5??53:32 980548...@qq.com ??
 
 my metastore is like this  
  property
 namejavax.jdo.option.ConnectionURL/name
 valuejdbc:mysql://192.168.1.40:3306/hive/value
 /property
 
 property
 namejavax.jdo.option.ConnectionDriverName/name
 valuecom.mysql.jdbc.Driver/value
 descriptionDriver class name for a JDBC metastore/description
 /property
 
 property
 namejavax.jdo.option.ConnectionUserName/name
 valuehive/value
 /property 
 
 property
 namejavax.jdo.option.ConnectionPassword/name
 valuehive123!@#/value
 /property
 
 property
 namehive.metastore.warehouse.dir/name 

 value/user/${user.name}/hive-warehouse/value
 descriptionlocation of default database for the 
 warehouse/description
 /property
 
 
 
 --  --
 ??: Wang, Daoyuan;daoyuan.w...@intel.com;
 : 2015??5??5??(??) 3:22
 ??: ??980548...@qq.com; luohui20001luohui20...@sina.com; 
 : useruser@spark.apache.org; 
 : RE: ??spark Unable to instantiate 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient
 
 How did you configure your metastore?
 
  
 
 Thanks,
 
 Daoyuan
 
 ?0?2 
 From: ?? [mailto:980548...@qq.com] 
 Sent: Tuesday, May 05, 2015 3:11 PM
 To: luohui20001
 Cc: user
 Subject: ??spark Unable to instantiate 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient
 
  
 
 hi luo,
 thanks for your reply in fact I can use hive by spark on my  spark master 
 machine, but when I copy my spark files to another machine  and when I want 
 to access the hive by spark get the error Unable to instantiate 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient , I have copy 
 hive-site.xml to spark conf directory and I have the authenticated to access 
 hive metastore warehouse;
 
  
 
 Thanks , Best regards!
 
 --  --
 
 ??: luohui20001;luohui20...@sina.com mailto:luohui20...@sina.com;
 
 : 2015??5??5??(??) 9:56
 
 ??: ??980548...@qq.com mailto:980548...@qq.com; 
 useruser@spark.apache.org mailto:user@spark.apache.org; 
 
 : ??spark Unable to instantiate 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient
 
  
 
 you may need to copy hive-site.xml to your spark conf directory and check 
 your hive metastore warehouse setting, and also check if you are 
 authenticated to access hive metastore warehouse.
 
 
 
 
  
 
 Thanksamp;Best regards!
  San.Luo
 
  
 
 -  -
 ?? 980548...@qq.com mailto:980548...@qq.com
 user user@spark.apache.org mailto:user@spark.apache.org
 ??spark Unable to instantiate 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient
 ??2015??05??05?? 08??49??
 
 
 hi all,
when i use submit a spark-sql programe to select data from my hive 
 database I get an error like this:
 User class threw exception: java.lang.RuntimeException: Unable to instantiate 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient ,what's wrong with my 
 spark configure ,thank any help !

Re: Re: sparksql running slow while joining 2 tables.

2015-05-05 Thread Olivier Girardot

Can you activate your eventLogs and send them us ?
Thank you !

Le mar. 5 mai 2015 à 04:56, luohui20001 luohui20...@sina.com a écrit :

 Yes,just by default 1 executor.thanks



 发自我的小米手机
 在 2015年5月4日 下午10:01，ayan guha guha.a...@gmail.com写道：

 Are you using only 1 executor?

 On Mon, May 4, 2015 at 11:07 PM, luohui20...@sina.com wrote:

 hi Olivier

 spark1.3.1, with java1.8.0.45

 and add 2 pics .

 it seems like a GC issue. I also tried with different parameters like
 memory size of driverexecutor, memory fraction, java opts...

 but this issue still happens.


 

 Thanksamp;Best regards!
 罗辉 San.Luo

 - 原始邮件 -
 发件人：Olivier Girardot ssab...@gmail.com
 收件人：luohui20...@sina.com, user user@spark.apache.org
 主题：Re: sparksql running slow while joining 2 tables.
 日期：2015年05月04日 20点46分

 Hi,
 What is you Spark version ?

 Regards,

 Olivier.

 Le lun. 4 mai 2015 à 11:03, luohui20...@sina.com a écrit :

 hi guys

 when i am running a sql  like select a.name,a.startpoint,a.endpoint,
 a.piece from db a join sample b on (a.name = b.name) where (b.startpoint
  a.startpoint + 25); I found sparksql running slow in minutes which may
 caused by very long GC and shuffle time.


table db is created from a txt file size at 56mb while table
 sample sized at 26mb, both at small size.

my spark cluster is a standalone  pseudo-distributed spark
 cluster with 8g executor and 4g driver manager.

any advises? thank you guys.



 

 Thanksamp;Best regards!
 罗辉 San.Luo

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 Best Regards,
 Ayan Guha

Re: Is LIMIT n in Spark SQL useful？

2015-05-05 Thread Robin East

Michael

Are there plans to add LIMIT push down? It's quite a natural thing to do in 
interactive querying.

Sent from my iPhone

 On 4 May 2015, at 22:57, Michael Armbrust mich...@databricks.com wrote:
 
 The JDBC interface for Spark SQL does not support pushing down limits today.
 
 On Mon, May 4, 2015 at 8:06 AM, Robin East robin.e...@xense.co.uk wrote:
 and a further question - have you tried running this query in pqsl? what’s 
 the performance like there?
 
 On 4 May 2015, at 16:04, Robin East robin.e...@xense.co.uk wrote:
 
 What query are you running. It may be the case that your query requires 
 PosgreSQL to do a large amount of work before identifying the first n rows
 On 4 May 2015, at 15:52, Yi Zhang zhangy...@yahoo.com.INVALID wrote:
 
 I am trying to query PostgreSQL using LIMIT(n) to reduce memory size and 
 improve query performance, but I found it took long time as same as 
 querying not using LIMIT. It let me confused. Anybody know why?
 
 Thanks.
 
 Regards,
 Yi

Re: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index:

2015-05-05 Thread shahab

Thanks Tristan for sharing this. Actually this happens when I am reading a
csv file of 3.5 GB.

best,
/Shahab



On Tue, May 5, 2015 at 9:15 AM, Tristan Blakers tris...@blackfrog.org
wrote:

 Hi Shahab,

 I’ve seen exceptions very similar to this (it also manifests as negative
 array size exception), and I believe it’s a really bug in Kryo.

 See this thread:

 http://mail-archives.us.apache.org/mod_mbox/spark-user/201502.mbox/%3ccag02ijuw3oqbi2t8acb5nlrvxso2xmas1qrqd_4fq1tgvvj...@mail.gmail.com%3E

 Manifests in all of the following situations when working with an object
 graph in excess of a few GB: Joins, Broadcasts, and when using the hadoop
 save APIs.

 Tristan


 On 3 May 2015 at 07:26, Olivier Girardot ssab...@gmail.com wrote:

 Can you post your code, otherwise there's not much we can do.

 Regards,

 Olivier.

 Le sam. 2 mai 2015 à 21:15, shahab shahab.mok...@gmail.com a écrit :

 Hi,

 I am using sprak-1.2.0 and I used Kryo serialization but I get the
 following excepton.

 java.io.IOException: com.esotericsoftware.kryo.KryoException:
 java.lang.IndexOutOfBoundsException: Index: 3448, Size: 1

 I do apprecciate if anyone could tell me how I can resolve this?

 best,
 /Shahab

Re: java.io.IOException: No space left on device while doing repartitioning in Spark

2015-05-05 Thread Akhil Das

It could be filling up your /tmp directory. You need to set your
spark.local.dir or you can also specify SPARK_WORKER_DIR to another
location which has sufficient space.

Thanks
Best Regards

On Mon, May 4, 2015 at 7:27 PM, shahab shahab.mok...@gmail.com wrote:

 Hi,

 I am getting No space left on device exception when doing repartitioning
  of approx. 285 MB of data  while these is still 2 GB space left ??

 does it mean that repartitioning needs more space (more than 2 GB) for
 repartitioning of 285 MB of data ??

 best,
 /Shahab

 java.io.IOException: No space left on device
   at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
   at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
   at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
   at sun.nio.ch.IOUtil.write(IOUtil.java:51)
   at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:205)
   at 
 sun.nio.ch.FileChannelImpl.transferToTrustedChannel(FileChannelImpl.java:473)
   at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:569)
   at org.apache.spark.util.Utils$.copyStream(Utils.scala:331)
   at 
 org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$1.apply$mcVI$sp(ExternalSorter.scala:730)
   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
   at 
 org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:728)
   at 
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:68)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)

Re: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index:

2015-05-05 Thread Tristan Blakers

Hi Shahab,

I’ve seen exceptions very similar to this (it also manifests as negative
array size exception), and I believe it’s a really bug in Kryo.

See this thread:
http://mail-archives.us.apache.org/mod_mbox/spark-user/201502.mbox/%3ccag02ijuw3oqbi2t8acb5nlrvxso2xmas1qrqd_4fq1tgvvj...@mail.gmail.com%3E

Manifests in all of the following situations when working with an object
graph in excess of a few GB: Joins, Broadcasts, and when using the hadoop
save APIs.

Tristan


On 3 May 2015 at 07:26, Olivier Girardot ssab...@gmail.com wrote:

 Can you post your code, otherwise there's not much we can do.

 Regards,

 Olivier.

 Le sam. 2 mai 2015 à 21:15, shahab shahab.mok...@gmail.com a écrit :

 Hi,

 I am using sprak-1.2.0 and I used Kryo serialization but I get the
 following excepton.

 java.io.IOException: com.esotericsoftware.kryo.KryoException:
 java.lang.IndexOutOfBoundsException: Index: 3448, Size: 1

 I do apprecciate if anyone could tell me how I can resolve this?

 best,
 /Shahab

RE: 回复：spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-05-05 Thread Wang, Daoyuan

How did you configure your metastore?

Thanks,
Daoyuan

From: 鹰 [mailto:980548...@qq.com]
Sent: Tuesday, May 05, 2015 3:11 PM
To: luohui20001
Cc: user
Subject: 回复：spark Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient

hi luo,
thanks for your reply in fact I can use hive by spark on my  spark master 
machine, but when I copy my spark files to another machine  and when I want to 
access the hive by spark get the error Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient , I have copy 
hive-site.xml to spark conf directory and I have the authenticated to access 
hive metastore warehouse;

Thanks , Best regards!
-- 原始邮件 --
发件人: luohui20001;luohui20...@sina.commailto:luohui20...@sina.com;
发送时间: 2015年5月5日(星期二) 上午9:56
收件人: 鹰980548...@qq.commailto:980548...@qq.com; 
useruser@spark.apache.orgmailto:user@spark.apache.org;
主题: 回复：spark Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient

you may need to copy hive-site.xml to your spark conf directory and check your 
hive metastore warehouse setting, and also check if you are authenticated to 
access hive metastore warehouse.

Thanksamp;Best regards!
罗辉 San.Luo

- 原始邮件 -
发件人：鹰 980548...@qq.commailto:980548...@qq.com
收件人：user user@spark.apache.orgmailto:user@spark.apache.org
主题：spark Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
日期：2015年05月05日 08点49分

hi all,
   when i use submit a spark-sql programe to select data from my hive 
database I get an error like this:
User class threw exception: java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient ,what's wrong with my 
spark configure ,thank any help !

??????RE: ??????spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-05-05 Thread ??

my metastore is like this  
 property
namejavax.jdo.option.ConnectionURL/name
valuejdbc:mysql://192.168.1.40:3306/hive/value
/property

property
namejavax.jdo.option.ConnectionDriverName/name
valuecom.mysql.jdbc.Driver/value
descriptionDriver class name for a JDBC metastore/description
/property

property
namejavax.jdo.option.ConnectionUserName/name
valuehive/value
/property 

property
namejavax.jdo.option.ConnectionPassword/name
valuehive123!@#/value
/property

property
namehive.metastore.warehouse.dir/name   
 
value/user/${user.name}/hive-warehouse/value
descriptionlocation of default database for the 
warehouse/description
/property





--  --
??: Wang, Daoyuan;daoyuan.w...@intel.com;
: 2015??5??5??(??) 3:22
??: ??980548...@qq.com; luohui20001luohui20...@sina.com; 
: useruser@spark.apache.org; 
: RE: ??spark Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient



  
How did you configure your metastore?
 
 
 
Thanks,
 
Daoyuan
 
 
 
From: ?? [mailto:980548...@qq.com] 
 Sent: Tuesday, May 05, 2015 3:11 PM
 To: luohui20001
 Cc: user
 Subject: ??spark Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
 
 
 
hi luo,
 thanks for your reply in fact I can use hive by spark on my  spark master 
machine, but when I copy my spark files to another machine  and when I want to 
access the hive by spark get the error Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient  , I have copy 
hive-site.xml to spark conf directory and I have the authenticated to access 
hive metastore warehouse;
   
 
 
  
Thanks , Best regards!
 
  
--  --
 
   
??: luohui20001;luohui20...@sina.com;
 
  
: 2015??5??5??(??) 9:56
 
  
??: ??980548...@qq.com;  useruser@spark.apache.org; 
 
  
: ??spark Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
 
 
  
 
 
 
you may need to copy hive-site.xml to your spark conf directory and check your 
hive metastore warehouse setting, and also check if you are authenticated to 
access hive metastore warehouse.
 
 
  

 
   
 
 
 
Thanksamp;Best regards!
  San.Luo
 
 
 
   
-  -
 ?? 980548...@qq.com
 user user@spark.apache.org
 ??spark Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
 ??2015??05??05?? 08??49??
 
 

 hi all,
when i use submit a spark-sql programe to select data from my hive 
database I get an error like this:
 User class threw exception: java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient ,what's wrong with my 
spark configure ,thank any help !

Re: Spark partitioning question

2015-05-05 Thread Marius Danciu

Turned out that is was sufficient do to repartitionAndSortWithinPartitions
... so far so good ;)

On Tue, May 5, 2015 at 9:45 AM Marius Danciu marius.dan...@gmail.com
wrote:

 Hi Imran,

 Yes that's what MyPartitioner does. I do see (using traces from
 MyPartitioner) that the key is partitioned on partition 0 but then I see
 this record arriving in both Yarn containers (I see it in the logs).
 Basically I need to emulate a Hadoop map-reduce job in Spark and groupByKey
 seemed a natural fit ( ... I am aware of its limitations).

 Thanks,
 Marius

 On Mon, May 4, 2015 at 10:45 PM Imran Rashid iras...@cloudera.com wrote:

 Hi Marius,

 I am also a little confused -- are you saying that myPartitions is
 basically something like:

 class MyPartitioner extends Partitioner {
   def numPartitions = 1
   def getPartition(key: Any) = 0
 }

 ??

 If so, I don't understand how you'd ever end up data in two partitions.
 Indeed, than everything before the call to partitionBy(myPartitioner) is
 somewhat irrelevant.  The important point is the partitionsBy should put
 all the data in one partition, and then the operations after that do not
 move data between partitions.  so if you're really observing data in two
 partitions, then it would good to know more about what version of spark you
 are on, your data etc. as it sounds like a bug.

 But, I have a feeling there is some misunderstanding about what your
 partitioner is doing.  Eg., I think doing groupByKey followed by sortByKey
 doesn't make a lot of sense -- in general one sortByKey is all you need
 (its not exactly the same, but most probably close enough, and avoids doing
 another expensive shuffle).  If you can share a bit more information on
 your partitioner, and what properties you need for your f, that might
 help.

 thanks,
 Imran


 On Tue, Apr 28, 2015 at 7:10 AM, Marius Danciu marius.dan...@gmail.com
 wrote:

 Hello all,

 I have the following Spark (pseudo)code:

 rdd = mapPartitionsWithIndex(...)
 .mapPartitionsToPair(...)
 .groupByKey()
 .sortByKey(comparator)
 .partitionBy(myPartitioner)
 .mapPartitionsWithIndex(...)
 .mapPartitionsToPair( *f* )

 The input data has 2 input splits (yarn 2.6.0).
 myPartitioner partitions all the records on partition 0, which is
 correct, so the intuition is that f provided to the last transformation
 (mapPartitionsToPair) would run sequentially inside a single yarn
 container. However from yarn logs I do see that both yarn containers are
 processing records from the same partition ... and *sometimes*  the
 over all job fails (due to the code in f which expects a certain order of
 records) and yarn container 1 receives the records as expected, whereas
 yarn container 2 receives a subset of records ... for a reason I cannot
 explain and f fails.

 The overall behavior of this job is that sometimes it succeeds and
 sometimes it fails ... apparently due to inconsistent propagation of sorted
 records to yarn containers.


 If any of this makes any sense to you, please let me know what I am
 missing.



 Best,
 Marius

Re: Spark partitioning question

2015-05-05 Thread Marius Danciu

Hi Imran,

Yes that's what MyPartitioner does. I do see (using traces from
MyPartitioner) that the key is partitioned on partition 0 but then I see
this record arriving in both Yarn containers (I see it in the logs).
Basically I need to emulate a Hadoop map-reduce job in Spark and groupByKey
seemed a natural fit ( ... I am aware of its limitations).

Thanks,
Marius

On Mon, May 4, 2015 at 10:45 PM Imran Rashid iras...@cloudera.com wrote:

 Hi Marius,

 I am also a little confused -- are you saying that myPartitions is
 basically something like:

 class MyPartitioner extends Partitioner {
   def numPartitions = 1
   def getPartition(key: Any) = 0
 }

 ??

 If so, I don't understand how you'd ever end up data in two partitions.
 Indeed, than everything before the call to partitionBy(myPartitioner) is
 somewhat irrelevant.  The important point is the partitionsBy should put
 all the data in one partition, and then the operations after that do not
 move data between partitions.  so if you're really observing data in two
 partitions, then it would good to know more about what version of spark you
 are on, your data etc. as it sounds like a bug.

 But, I have a feeling there is some misunderstanding about what your
 partitioner is doing.  Eg., I think doing groupByKey followed by sortByKey
 doesn't make a lot of sense -- in general one sortByKey is all you need
 (its not exactly the same, but most probably close enough, and avoids doing
 another expensive shuffle).  If you can share a bit more information on
 your partitioner, and what properties you need for your f, that might
 help.

 thanks,
 Imran


 On Tue, Apr 28, 2015 at 7:10 AM, Marius Danciu marius.dan...@gmail.com
 wrote:

 Hello all,

 I have the following Spark (pseudo)code:

 rdd = mapPartitionsWithIndex(...)
 .mapPartitionsToPair(...)
 .groupByKey()
 .sortByKey(comparator)
 .partitionBy(myPartitioner)
 .mapPartitionsWithIndex(...)
 .mapPartitionsToPair( *f* )

 The input data has 2 input splits (yarn 2.6.0).
 myPartitioner partitions all the records on partition 0, which is
 correct, so the intuition is that f provided to the last transformation
 (mapPartitionsToPair) would run sequentially inside a single yarn
 container. However from yarn logs I do see that both yarn containers are
 processing records from the same partition ... and *sometimes*  the over
 all job fails (due to the code in f which expects a certain order of
 records) and yarn container 1 receives the records as expected, whereas
 yarn container 2 receives a subset of records ... for a reason I cannot
 explain and f fails.

 The overall behavior of this job is that sometimes it succeeds and
 sometimes it fails ... apparently due to inconsistent propagation of sorted
 records to yarn containers.


 If any of this makes any sense to you, please let me know what I am
 missing.



 Best,
 Marius

??????spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-05-05 Thread ??

hi luo,
thanks for your reply in fact I can use hive by spark on my  spark master 
machine, but when I copy my spark files to another machine  and when I want to 
access the hive by spark get the error Unable to  instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient , I have copy 
hive-site.xml to spark conf directory and I have the authenticated to access 
hive metastore warehouse;


Thanks , Best regards!

--  --
??: luohui20001;luohui20...@sina.com;
: 2015??5??5??(??) 9:56
??: ??980548...@qq.com; useruser@spark.apache.org; 

: ??spark Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient



you may need to copy hive-site.xml to your spark conf directory and check your 
hive metastore warehouse setting, and also check if you are authenticated to 
access hive metastore warehouse.




 

Thanksamp;Best regards!
 San.Luo


-  -
?? 980548...@qq.com
user user@spark.apache.org
??spark Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
??2015??05??05?? 08??49??


hi all,
   when i use submit a spark-sql programe to select data from my hive 
database I get an error like this:
User class threw exception: java.lang.RuntimeException: Unable to  instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient ,what's wrong with my 
spark configure ,thank any help !

?????? spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-05-05 Thread ??

thanks  jeanlyn itworks




--  --
??: jeanlyn;oujianl...@jd.com;
: 2015??5??5??(??) 3:40
??: ??980548...@qq.com; 
: Wang, Daoyuandaoyuan.w...@intel.com; useruser@spark.apache.org; 
: Re: spark Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient



Have you config the SPARK_CLASSPATH with the jar of mysql in spark-env.sh?For 
example (export SPARK_CLASSPATH+=:/path/to/mysql-connector-java-5.1.18-bin.jar)
?? 2015??5??53:32 980548...@qq.com ??

my metastore is like this  
 property
namejavax.jdo.option.ConnectionURL/name
valuejdbc:mysql://192.168.1.40:3306/hive/value
/property

property
namejavax.jdo.option.ConnectionDriverName/name
valuecom.mysql.jdbc.Driver/value
descriptionDriver class name for a JDBC metastore/description
/property

property
namejavax.jdo.option.ConnectionUserName/name
valuehive/value
/property 

property
namejavax.jdo.option.ConnectionPassword/name
valuehive123!@#/value
/property

property
namehive.metastore.warehouse.dir/name   
 
value/user/${user.name}/hive-warehouse/value
descriptionlocation of default database for the 
warehouse/description
/property





--  --
??: Wang, Daoyuan;daoyuan.w...@intel.com;
: 2015??5??5??(??) 3:22
??: ??980548...@qq.com; luohui20001luohui20...@sina.com; 
: useruser@spark.apache.org; 
: RE: ??spark Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient




How did you configure your metastore?

 

Thanks,

Daoyuan

 

From: ?? [mailto:980548...@qq.com] 
Sent: Tuesday, May 05, 2015 3:11 PM
To: luohui20001
Cc: user
Subject: ??spark Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient

 

hi luo,
thanks for your reply in fact I can use hive by spark on my  spark master 
machine, but when I copy my spark files to another machine  and when I want to 
access the hive by spark get the error Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient , I have copy 
hive-site.xml to spark conf directory and I have the authenticated to access 
hive metastore warehouse;

 


Thanks , Best regards!


--  --


??: luohui20001;luohui20...@sina.com;


: 2015??5??5??(??) 9:56


??: ??980548...@qq.com; useruser@spark.apache.org; 


: ??spark Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient



 


you may need to copy hive-site.xml to your spark conf directory and check your 
hive metastore warehouse setting, and also check if you are authenticated to 
access hive metastore warehouse.






 


Thanksamp;Best regards!
 San.Luo


 

-  -
?? 980548...@qq.com
user user@spark.apache.org
??spark Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
??2015??05??05?? 08??49??



hi all,
   when i use submit a spark-sql programe to select data from my hive 
database I get an error like this:
User class threw exception: java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient ,what's wrong with my 
spark configure ,thank any help !

RE: Unable to join table across data sources using sparkSQL

2015-05-05 Thread Ishwardeep Singh

Hi ,

I am using Spark 1.3.0.

I was able to join a JSON file on HDFS registered as a TempTable with a table 
in MySQL. On the same lines I tried to join a table in Hive with another table 
in Teradata but I get a query parse exception.

Regards,
Ishwardeep


From: ankitjindal [via Apache Spark User List] 
[mailto:ml-node+s1001560n22762...@n3.nabble.com]
Sent: Tuesday, May 5, 2015 1:26 PM
To: Ishwardeep Singh
Subject: Re: Unable to join table across data sources using sparkSQL

Hi

I was doing the same but with a file in hadoop as a temp table and one 
table in sql server but i succeeded in it.

Which spark version are you using currently?

Thanks
Ankit


If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-join-table-across-data-sources-using-sparkSQL-tp22761p22762.html
To unsubscribe from Unable to join table across data sources using sparkSQL, 
click 
herehttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=22761code=aXNod2FyZGVlcC5zaW5naEBpbXBldHVzLmNvLmlufDIyNzYxfDgzMDExNzI4OQ==.
NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml








NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-join-table-across-data-sources-using-sparkSQL-tp22761p22763.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Event generator for SPARK-Streaming from csv

2015-05-05 Thread anshu shukla

I know these methods , but i need to create events using the timestamps in
the data tuples ,means every time a new tuple  is generated using the
timestamp in a CSV file .this will be useful to simulate the data rate
 with time just like real sensor data .

On Fri, May 1, 2015 at 2:52 PM, Juan Rodríguez Hortalá 
juan.rodriguez.hort...@gmail.com wrote:

 Hi,

 Maybe you could use streamingContext.fileStream like in the example from
 https://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers,
 you can read from files on any file system compatible with the HDFS API
 (that is, HDFS, S3, NFS, etc.). You could split the file into several
 smaller files, and move them to the target folder one by one with some
 sleep time in between to simulate a stream of data with custom granularity.

 Hope that helps,

 Greetings,

 Juan

 2015-05-01 9:30 GMT+02:00 anshu shukla anshushuk...@gmail.com:





 I have the real DEBS-TAxi data in csv file , in order to operate over it
 how to simulate a Spout kind  of thing as event generator using the
 timestamps in CSV file.




 --
 Thanks  Regards,
 Anshu Shukla





-- 
Thanks  Regards,
Anshu Shukla

Maximum Core Utilization

2015-05-05 Thread Manu Kaul

Hi All,
For a job I am running on Spark with a dataset of say 350,000 lines (not
big), I am finding that even though my cluster has a large number of cores
available (like 100 cores), the Spark system seems to stop after using just
4 cores and after that the runtime is pretty much a straight line no matter
how many more cores are thrown at it. I am wondering if Spark tries to
figure out the maximum no. of cores to use based on the size of the
dataset? If yes, is there a way to disable this feature and force it to use
all the cores available?

Thanks,
Manu

-- 

The greater danger for most of us lies not in setting our aim too high and
falling short; but in setting our aim too low, and achieving our mark.
- Michelangelo

RE: Remoting warning when submitting to cluster

2015-05-05 Thread Javier Delgadillo

I downloaded the 1.3.1 source distribution and built on Windows (laptop 8.0 and 
desktop 8.1)

Here’s what I’m running:
Desktop:
Spark Master (%SPARK_HOME%\bin\spark-class2.cmd 
org.apache.spark.deploy.master.Master -h desktop --port 7077)
Spark Worker (%SPARK_HOME%\bin\spark-class2.cmd 
org.apache.spark.deploy.worker.Worker spark://desktop:7077)
Kafka Broker
ZooKeeper Server

Laptop:
2 Kafka Producers each sending to a unique topic to broker running on Desktop
Driver App

In this scenario, I get no messages showing up in the Driver App’s console.  If 
on the other hand, I either move the driver app to the desktop or run the 
worker on the laptop instead of the desktop, then I see the counts as expected 
(meaning the driver and the worker/executor are on the same machine).

When I moved this scenario to a set of machines in a separate network, 
separating the executor and driver worked as expected. So it seems a networking 
issue was causing the failure.

Now to the followup question:  which property do I set to configure the port so 
that I can ensure it’s a port that isn’t blocked by Systems?

The candidates I see:
spark.blockManager.port
spark.blockManager.port
spark.driver.host
spark.driver.port
spark.executor.port


-Javier

From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Monday, May 4, 2015 12:42 AM
To: Javier Delgadillo
Cc: user@spark.apache.org
Subject: Re: Remoting warning when submitting to cluster

Looks like a version incompatibility, just make sure you have the proper 
version of spark. Also look further in the stacktrace what is causing Futures 
timed out (it could be a network issue also if the ports aren't opened properly)

Thanks
Best Regards

On Sat, May 2, 2015 at 12:04 AM, javidelgadillo 
jdelgadi...@esri.commailto:jdelgadi...@esri.com wrote:
Hello all!!

We've been prototyping some spark applications to read messages from Kafka
topics.  The application is quite simple, we use KafkaUtils.createStream to
receive a stream of CSV messages from a Kafka Topic.  We parse the CSV and
count the number of messages we get in each RDD. At a high-level (removing
the abstractions of our appliction), it looks like this:

val sc = new SparkConf()
  .setAppName(appName)
  .set(spark.executor.memory, 1024m)
  .set(spark.cores.max, 3)
  .set(spark.app.namehttp://spark.app.name, appName)
  .set(spark.ui.port, sparkUIPort)

 val ssc =  new StreamingContext(sc, Milliseconds(emitInterval.toInt))

KafkaUtils
  .createStream(ssc, zookeeperQuorum, consumerGroup, topicMap)
  .map(_._2)
  .foreachRDD( (rdd:RDD, time: Time) = {
println(Time %s: (%s total records).format(time, rdd.count()))
  }

When I submit this using to spark master as local[3] everything behaves as
I'd expect.  After some startup overhead, I'm seeing the count printed to be
the same as the count I'm simulating  (1 every second for example).

When I submit this to a spark master using spark://master.host:7077, the
behavior is different.  The overhead go start receiving seems longer and
some runs I don't see anything for 30 seconds even though my simulator is
sending messages to the topic.  I also see the following error written to
stderr by every executor assigned to the job:

Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
15/05/01 10:11:38 INFO SecurityManager: Changing view acls to: username
15/05/01 10:11:38 INFO SecurityManager: Changing modify acls to: username
15/05/01 10:11:38 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(javi4211);
users with modify permissions: Set(username)
15/05/01 10:11:38 INFO Slf4jLogger: Slf4jLogger started
15/05/01 10:11:38 INFO Remoting: Starting remoting
15/05/01 10:11:39 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://driverpropsfetc...@master.host:56534]
15/05/01 10:11:39 INFO Utils: Successfully started service
'driverPropsFetcher' on port 56534.
15/05/01 10:11:40 WARN Remoting: Tried to associate with unreachable remote
address [akka.tcp://sparkdri...@driver.host:51837]. Address is now gated for
5000 ms, all messages to this address will be delivered to dead letters.
Reason: Connection refused: no further information:
driver.host/10.27.51.214:51837http://10.27.51.214:51837
15/05/01 10:12:09 ERROR UserGroupInformation: PriviledgedActionException
as:username cause:java.util.concurrent.TimeoutException: Futures timed out
after [30 seconds]
Exception in thread main java.lang.reflect.UndeclaredThrowableException:
Unknown exception in doAs
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:128)
at

Re: Maximum Core Utilization

2015-05-05 Thread Richard Marscher

Hi,

do you have information on how many partitions/tasks the stage/job is
running? By default there is 1 core per task, and your number of concurrent
tasks may be limiting core utilization.

There are a few settings you could play with, assuming your issue is
related to the above:
spark.default.parallelism
spark.cores.max
spark.task.cpus

On Tue, May 5, 2015 at 3:55 PM, Manu Kaul manohar.k...@gmail.com wrote:

 Hi All,
 For a job I am running on Spark with a dataset of say 350,000 lines (not
 big), I am finding that even though my cluster has a large number of cores
 available (like 100 cores), the Spark system seems to stop after using just
 4 cores and after that the runtime is pretty much a straight line no matter
 how many more cores are thrown at it. I am wondering if Spark tries to
 figure out the maximum no. of cores to use based on the size of the
 dataset? If yes, is there a way to disable this feature and force it to use
 all the cores available?

 Thanks,
 Manu

 --

 The greater danger for most of us lies not in setting our aim too high and
 falling short; but in setting our aim too low, and achieving our mark.
 - Michelangelo

Multilabel Classification in spark

2015-05-05 Thread peterg

Hi all,

I'm looking to implement a Multilabel classification algorithm but I am
surprised to find that there are not any in the spark-mllib core library. Am
I missing something? Would someone point me in the right direction?

Thanks!

Peter  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Multilabel-Classification-in-spark-tp22775.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Multilabel Classification in spark

2015-05-05 Thread DB Tsai

LogisticRegression in MLlib package supports multilable classification.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Tue, May 5, 2015 at 1:13 PM, peterg pe...@garbers.me wrote:
 Hi all,

 I'm looking to implement a Multilabel classification algorithm but I am
 surprised to find that there are not any in the spark-mllib core library. Am
 I missing something? Would someone point me in the right direction?

 Thanks!

 Peter



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Multilabel-Classification-in-spark-tp22775.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Help with datetime comparison in SparkSQL statement ...

2015-05-05 Thread subscripti...@prismalytics.io


Hello Friends:

Here's sample output from a SparkSQL query that works, just so you can 
see the

underlying data structure; followed by one that fails.

 # Just you you can see the DataFrame structure ...

 resultsRDD = sqlCtx.sql(SELECT * FROM rides WHERE 
trip_time_in_secs = 3780)

 resultsRDD.collect() # WORKS.
[Row(pickup_datetime=datetime.datetime(2013, 10, 4, 8, 0),
 dropoff_datetime=datetime.datetime(2013, 10, 4, 9, 3),
 trip_time_in_secs=3780,
 trip_distance=17.10381469727),

 Row(pickup_datetime=datetime.datetime(2013, 10, 18, 8, 0),
 dropoff_datetime=datetime.datetime(2013, 10, 18, 9, 3),
 trip_time_in_secs=3780,
 trip_distance=17.92076293945), ... )

But the following SQL experiences the exception shown below.
 resultsRDD = sqlCtx.sql(SELECT * FROM rides WHERE pickup_datetime 
 datetime.datetime(2013,12,1,0,0,0))

 resultsRDD.collect() # FAILS.

/   py4j.protocol.Py4JJavaError: An error occurred while calling o53.sql.//
//   java.lang.RuntimeException: [1.62] failure: ``union'' expected but 
`(' found/


It's the first time I'm trying this and seemingly doing it incorrectly.
Can anyone show me how to correct this?

Thank you! =:)
nmv

--
PRISMALYTICS Sincerely yours,
Team PRISMALYTICS

PRISMALYTICS, LLC. http://www.prismalytics.com/ | www.prismalytics.com 
http://www.prismalytics.com/
P: 212.882.1276 tel:212.882.1276 | subscripti...@prismalytics.io 
mailto:subscripti...@prismalytics.io
Follow Us: https://www.LinkedIn.com/company/prismalytics 
https://www.linkedin.com/company/prismalytics


Prismalytics, LLC. http://www.prismalytics.com/
data analytics to literally count on

example code for current date in spark sql

2015-05-05 Thread kiran mavatoor

Hi,
In Hive , I am using unix_timestamp() as 'update_on' to insert current date in 
'update_on' column of the table. Now I am converting it into spark sql. Please 
suggest example code to insert current date and time into column of the table 
using spark sql. 
CheersKiran.

setting spark configuration properties problem

2015-05-05 Thread Hafiz Mujadid

Hi all,

i have declared spark context at start of my program and then i want to
change it's configurations at some later stage in my code as written below

val conf = new SparkConf().setAppName(Cassandra Demo)
var sc:SparkContext=new SparkContext(conf)
sc.getConf.set(spark.cassandra.connection.host, host)
println(sc.getConf.get(spark.cassandra.connection.host))

in the above code following exception occur

Exception in thread main java.util.NoSuchElementException:
spark.cassandra.connection.host

Actually I want to set cassandra host property and i just want to set this
if there is some job related to cassandra so i declare sparkcontext earlier
and then want to set this property at some later stage.

Any suggestion?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/setting-spark-configuration-properties-problem-tp22764.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

JAVA for SPARK certification

2015-05-05 Thread Gourav Sengupta

Hi,

how important is JAVA for Spark certification? Will learning only Python
and Scala not work?


Regards,
Gourav

Map one RDD into two RDD

2015-05-05 Thread Bill Q

Hi all,
I have a large RDD that I map a function to it. Based on the nature of each
record in the input RDD, I will generate two types of data. I would like to
save each type into its own RDD. But I can't seem to find an efficient way
to do it. Any suggestions?

Many thanks.


Bill


-- 
Many thanks.


Bill

Re: Map one RDD into two RDD

2015-05-05 Thread Ted Yu

Have you looked at RDD#randomSplit() (as example) ?

Cheers

On Tue, May 5, 2015 at 2:42 PM, Bill Q bill.q@gmail.com wrote:

 Hi all,
 I have a large RDD that I map a function to it. Based on the nature of
 each record in the input RDD, I will generate two types of data. I would
 like to save each type into its own RDD. But I can't seem to find an
 efficient way to do it. Any suggestions?

 Many thanks.


 Bill


 --
 Many thanks.


 Bill

Configuring Number of Nodes with Standalone Scheduler

2015-05-05 Thread Nastooh Avessta (navesta)

Hi
I have a 1.0.0 cluster with multiple worker nodes that deploy a number of 
external tasks, through getRuntime().exec.  Currently I have no control on how 
many nodes get deployed for a given task. At times scheduler evenly distributes 
the executors among all nodes and at other times it  only uses 1 node. (The 
difficulty with the latter is that the deployed tasks run out of memory, at 
which point kernel intervenes and kills them.) I've tried setting 
spark.cores.max to available number of cores, spark.deploy.spreadOut to true, 
spark.scheduler.mode to FAIR, etc., to no avail. Is there a non-documented 
parameter or a priming procedure to do this?
Cheers,

[http://www.cisco.com/web/europe/images/email/signature/logo05.jpg]

Nastooh Avessta
ENGINEER.SOFTWARE ENGINEERING
nave...@cisco.com
Phone: +1 604 647 1527

Cisco Systems Limited
595 Burrard Street, Suite 2123 Three Bentall Centre, PO Box 49121
VANCOUVER
BRITISH COLUMBIA
V7X 1J1
CA
Cisco.comhttp://www.cisco.com/





[Think before you print.]Think before you print.

This email may contain confidential and privileged material for the sole use of 
the intended recipient. Any review, use, distribution or disclosure by others 
is strictly prohibited. If you are not the intended recipient (or authorized to 
receive for the recipient), please contact the sender by reply email and delete 
all copies of this message.
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html

Cisco Systems Canada Co, 181 Bay St., Suite 3400, Toronto, ON, Canada, M5J 2T3. 
Phone: 416-306-7000; Fax: 416-306-7099. 
Preferenceshttp://www.cisco.com/offer/subscribe/?sid=000478326 - 
Unsubscribehttp://www.cisco.com/offer/unsubscribe/?sid=000478327 - 
Privacyhttp://www.cisco.com/web/siteassets/legal/privacy.html

Re: JAVA for SPARK certification

2015-05-05 Thread Kartik Mehta

I too have similar question.

My understanding is since Spark written in scala, having done in Scala will
be ok for certification.

If someone who has done certification can confirm.

Thanks,

Kartik
On May 5, 2015 5:57 AM, Gourav Sengupta gourav.sengu...@gmail.com wrote:

 Hi,

 how important is JAVA for Spark certification? Will learning only Python
 and Scala not work?


 Regards,
 Gourav

Re: JAVA for SPARK certification

2015-05-05 Thread Stephen Boesch

There are questions in all three languages.

2015-05-05 3:49 GMT-07:00 Kartik Mehta kartik.meht...@gmail.com:

 I too have similar question.

 My understanding is since Spark written in scala, having done in Scala
 will be ok for certification.

 If someone who has done certification can confirm.

 Thanks,

 Kartik
 On May 5, 2015 5:57 AM, Gourav Sengupta gourav.sengu...@gmail.com
 wrote:

 Hi,

 how important is JAVA for Spark certification? Will learning only Python
 and Scala not work?


 Regards,
 Gourav

spark sql, creating literal columns in java.

2015-05-05 Thread Jan-Paul Bultmann

Hey,
What is the recommended way to create literal columns in java?
Scala has the `lit` function from  `org.apache.spark.sql.functions`.
Should it be called from java as well?

Cheers jan

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: RDD coalesce or repartition by #records or #bytes?

2015-05-05 Thread Du Li

Hi, Spark experts:
I did rdd.coalesce(numPartitions).saveAsSequenceFile(dir) in my code, which 
generates the rdd's in streamed batches. It generates numPartitions of files as 
expected with names dir/part-x. However, the first couple of files (e.g., 
part-0, part-1) have many times of records than the other files. This 
highly skewed distribution causes stragglers (and hence unpredictable execution 
time) and also the need to allocate memory by the worst cases (because some of 
the records can be much larger than average).
To solve this problem, I replaced coalesce(numPartitions) with 
repartition(numPartitions) or coalesce(numPartitions, shuffle=true), which are 
equivalent. As a result, the records are more evenly distributed over the 
output files and the execution time becomes more predictable. It of coarse 
incurs a lot of shuffle traffic. However, the GC time became prohibitively 
high, which crashed my app in just a few hours. Adding more memory to executors 
didn't seem to help.
Do you have any suggestion here on how to spread the data without the GC costs? 
Does repartition() redistribute/shuffle every record by hash partitioner? Why 
does it drive the GC time so high?
Thanks,Du 


 On Wednesday, March 4, 2015 5:39 PM, Zhan Zhang zzh...@hortonworks.com 
wrote:
   

 It use HashPartitioner to distribute the record to different partitions, but 
the key is just integer  evenly across output partitions.
From the code, each resulting partition will get very similar number of 
records.

Thanks.
Zhan Zhang

On Mar 4, 2015, at 3:47 PM, Du Li l...@yahoo-inc.com.INVALID wrote:

Hi,
My RDD's are created from kafka stream. After receiving a RDD, I want to do 
coalesce/repartition it so that the data will be processed in a set of machines 
in parallel as even as possible. The number of processing nodes is larger than 
the receiving nodes.
My question is how the coalesce/repartition works. Does it distribute by the 
number of records or number of bytes? In my app, my observation is that the 
distribution seems by number of records. The consequence is, however, some 
executors have to process x1000 as much as data when the sizes of records are 
very skewed. Then we have to allocate memory by the worst case.
Is there a way to programmatically affect the coalesce /repartition scheme?
Thanks,Du

84 matches

Mail list logo