Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Todd
Thanks Hao for the reply.
I turn the merge sort join off, the physical plan is below, but the performance 
is roughly the same as it on...

== Physical Plan ==
TungstenProject 
[ss_quantity#10,ss_list_price#12,ss_coupon_amt#19,ss_cdemo_sk#4,ss_item_sk#2,ss_promo_sk#8,ss_sold_date_sk#0]
 ShuffledHashJoin [ss_item_sk#2], [ss_item_sk#25], BuildRight
  TungstenExchange hashpartitioning(ss_item_sk#2)
   ConvertToUnsafe
Scan 
ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_promo_sk#8,ss_quantity#10,ss_cdemo_sk#4,ss_list_price#12,ss_coupon_amt#19,ss_item_sk#2,ss_sold_date_sk#0]
  TungstenExchange hashpartitioning(ss_item_sk#25)
   ConvertToUnsafe
Scan 
ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_item_sk#25]

Code Generation: true








At 2015-09-11 13:48:23, "Cheng, Hao"  wrote:


This is not a big surprise the SMJ is slower than the HashJoin, as we do not 
fully utilize the sorting yet, more details can be found at 
https://issues.apache.org/jira/browse/SPARK-2926 .

 

Anyway, can you disable the sort merge join by 
“spark.sql.planner.sortMergeJoin=false;” in Spark 1.5, and run the query again? 
In our previous testing, it’s about 20% slower for sort merge join. I am not 
sure if there anything else slow down the performance.

 

Hao

 

 

From: Jesse F Chen [mailto:jfc...@us.ibm.com]
Sent: Friday, September 11, 2015 1:18 PM
To: Michael Armbrust
Cc: Todd; user@spark.apache.org
Subject: Re: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 
1.4.1 SQL

 

Could this be a build issue (i.e., sbt package)?

If I ran the same jar build for 1.4.1 in 1.5, I am seeing large regression too 
in queries (all other things identical)...

I am curious, to build 1.5 (when it isn't released yet), what do I need to do 
with the build.sbt file?

any special parameters i should be using to make sure I load the latest hive 
dependencies?

Michael Armbrust ---09/10/2015 11:07:28 AM---I've been running TPC-DS SF=1500 
daily on Spark 1.4.1 and Spark 1.5 on S3, so this is surprising.  I

From: Michael Armbrust 
To: Todd 
Cc: "user@spark.apache.org" 
Date: 09/10/2015 11:07 AM
Subject: Re: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 
1.4.1 SQL




I've been running TPC-DS SF=1500 daily on Spark 1.4.1 and Spark 1.5 on S3, so 
this is surprising.  In my experiments Spark 1.5 is either the same or faster 
than 1.4 with only small exceptions.  A few thoughts,

 - 600 partitions is probably way too many for 6G of data.
 - Providing the output of explain for both runs would be helpful whenever 
reporting performance changes.

On Thu, Sep 10, 2015 at 1:24 AM, Todd  wrote:

Hi,

I am using data generated with 
sparksqlperf(https://github.com/databricks/spark-sql-perf) to test the spark 
sql performance (spark on yarn, with 10 nodes) with the following code (The 
table store_sales is about 90 million records, 6G in size)
 
val outputDir="hdfs://tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales"
val name="store_sales"
sqlContext.sql(
  s"""
  |CREATE TEMPORARY TABLE ${name}
  |USING org.apache.spark.sql.parquet
  |OPTIONS (
  |  path '${outputDir}'
  |)
""".stripMargin)

val sql="""
 |select
 |  t1.ss_quantity,
 |  t1.ss_list_price,
 |  t1.ss_coupon_amt,
 |  t1.ss_cdemo_sk,
 |  t1.ss_item_sk,
 |  t1.ss_promo_sk,
 |  t1.ss_sold_date_sk
 |from store_sales t1 join store_sales t2 on t1.ss_item_sk = 
t2.ss_item_sk
 |where
 |  t1.ss_sold_date_sk between 2450815 and 2451179
   """.stripMargin

val df = sqlContext.sql(sql)
df.rdd.foreach(row=>Unit)

With 1.4.1, I can finish the query in 6 minutes,  but  I need 10+ minutes with 
1.5.

The configuration are basically the same, since I copy the configuration from 
1.4.1 to 1.5:

sparkVersion1.4.11.5.0
scaleFactor3030
spark.sql.shuffle.partitions600600
spark.sql.sources.partitionDiscovery.enabledtruetrue
spark.default.parallelism200200
spark.driver.memory4G4G4G
spark.executor.memory4G4G
spark.executor.instances1010
spark.shuffle.consolidateFilestruetrue
spark.storage.memoryFraction0.40.4
spark.executor.cores33

I am not sure where is going wrong,any ideas?

 

RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Cheng, Hao
You mean the performance is still slow as the SMJ in Spark 1.5?

Can you set the spark.shuffle.reduceLocality.enabled=false when you start the 
spark-shell/spark-sql? It’s a new feature in Spark 1.5, and it’s true by 
default, but we found it probably causes the performance reduce dramatically.


From: Todd [mailto:bit1...@163.com]
Sent: Friday, September 11, 2015 2:17 PM
To: Cheng, Hao
Cc: Jesse F Chen; Michael Armbrust; user@spark.apache.org
Subject: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with 
spark 1.4.1 SQL

Thanks Hao for the reply.
I turn the merge sort join off, the physical plan is below, but the performance 
is roughly the same as it on...

== Physical Plan ==
TungstenProject 
[ss_quantity#10,ss_list_price#12,ss_coupon_amt#19,ss_cdemo_sk#4,ss_item_sk#2,ss_promo_sk#8,ss_sold_date_sk#0]
 ShuffledHashJoin [ss_item_sk#2], [ss_item_sk#25], BuildRight
  TungstenExchange hashpartitioning(ss_item_sk#2)
   ConvertToUnsafe
Scan 
ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_promo_sk#8,ss_quantity#10,ss_cdemo_sk#4,ss_list_price#12,ss_coupon_amt#19,ss_item_sk#2,ss_sold_date_sk#0]
  TungstenExchange hashpartitioning(ss_item_sk#25)
   ConvertToUnsafe
Scan 
ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_item_sk#25]

Code Generation: true




At 2015-09-11 13:48:23, "Cheng, Hao" 
> wrote:

This is not a big surprise the SMJ is slower than the HashJoin, as we do not 
fully utilize the sorting yet, more details can be found at 
https://issues.apache.org/jira/browse/SPARK-2926 .

Anyway, can you disable the sort merge join by 
“spark.sql.planner.sortMergeJoin=false;” in Spark 1.5, and run the query again? 
In our previous testing, it’s about 20% slower for sort merge join. I am not 
sure if there anything else slow down the performance.

Hao


From: Jesse F Chen [mailto:jfc...@us.ibm.com]
Sent: Friday, September 11, 2015 1:18 PM
To: Michael Armbrust
Cc: Todd; user@spark.apache.org
Subject: Re: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 
1.4.1 SQL


Could this be a build issue (i.e., sbt package)?

If I ran the same jar build for 1.4.1 in 1.5, I am seeing large regression too 
in queries (all other things identical)...

I am curious, to build 1.5 (when it isn't released yet), what do I need to do 
with the build.sbt file?

any special parameters i should be using to make sure I load the latest hive 
dependencies?

[Inactive hide details for Michael Armbrust ---09/10/2015 11:07:28 AM---I've 
been running TPC-DS SF=1500 daily on Spark 1.4.1 an]Michael Armbrust 
---09/10/2015 11:07:28 AM---I've been running TPC-DS SF=1500 daily on Spark 
1.4.1 and Spark 1.5 on S3, so this is surprising.  I

From: Michael Armbrust >
To: Todd >
Cc: "user@spark.apache.org" 
>
Date: 09/10/2015 11:07 AM
Subject: Re: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 
1.4.1 SQL





I've been running TPC-DS SF=1500 daily on Spark 1.4.1 and Spark 1.5 on S3, so 
this is surprising.  In my experiments Spark 1.5 is either the same or faster 
than 1.4 with only small exceptions.  A few thoughts,

 - 600 partitions is probably way too many for 6G of data.
 - Providing the output of explain for both runs would be helpful whenever 
reporting performance changes.

On Thu, Sep 10, 2015 at 1:24 AM, Todd > 
wrote:
Hi,

I am using data generated with 
sparksqlperf(https://github.com/databricks/spark-sql-perf) to test the spark 
sql performance (spark on yarn, with 10 nodes) with the following code (The 
table store_sales is about 90 million records, 6G in size)

val outputDir="hdfs://tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales"
val name="store_sales"
sqlContext.sql(
  s"""
  |CREATE TEMPORARY TABLE ${name}
  |USING org.apache.spark.sql.parquet
  |OPTIONS (
  |  path '${outputDir}'
  |)
""".stripMargin)

val sql="""
 |select
 |  t1.ss_quantity,
 |  t1.ss_list_price,
 |  t1.ss_coupon_amt,
 |  t1.ss_cdemo_sk,
 |  t1.ss_item_sk,
 |  t1.ss_promo_sk,
 |  t1.ss_sold_date_sk
 |from store_sales t1 join store_sales t2 on t1.ss_item_sk = 
t2.ss_item_sk
 |where
 |  t1.ss_sold_date_sk between 2450815 and 2451179
   """.stripMargin

val df = sqlContext.sql(sql)
df.rdd.foreach(row=>Unit)

With 1.4.1, I can finish the query in 6 minutes,  but  I need 10+ minutes with 
1.5.

The configuration are basically the same, since I copy the configuration from 
1.4.1 to 1.5:

sparkVersion1.4.11.5.0
scaleFactor   

Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Todd

Yes...

At 2015-09-11 14:34:46, "Cheng, Hao"  wrote:


You mean the performance is still slow as the SMJ in Spark 1.5?

 

Can you set the spark.shuffle.reduceLocality.enabled=false when you start the 
spark-shell/spark-sql? It’s a new feature in Spark 1.5, and it’s true by 
default, but we found it probably causes the performance reduce dramatically.

 

 

From: Todd [mailto:bit1...@163.com]
Sent: Friday, September 11, 2015 2:17 PM
To: Cheng, Hao
Cc: Jesse F Chen; Michael Armbrust; user@spark.apache.org
Subject: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with 
spark 1.4.1 SQL

 

Thanks Hao for the reply.
I turn the merge sort join off, the physical plan is below, but the performance 
is roughly the same as it on...

== Physical Plan ==
TungstenProject 
[ss_quantity#10,ss_list_price#12,ss_coupon_amt#19,ss_cdemo_sk#4,ss_item_sk#2,ss_promo_sk#8,ss_sold_date_sk#0]
 ShuffledHashJoin [ss_item_sk#2], [ss_item_sk#25], BuildRight
  TungstenExchange hashpartitioning(ss_item_sk#2)
   ConvertToUnsafe
Scan 
ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_promo_sk#8,ss_quantity#10,ss_cdemo_sk#4,ss_list_price#12,ss_coupon_amt#19,ss_item_sk#2,ss_sold_date_sk#0]
  TungstenExchange hashpartitioning(ss_item_sk#25)
   ConvertToUnsafe
Scan 
ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_item_sk#25]

Code Generation: true







At 2015-09-11 13:48:23, "Cheng, Hao"  wrote:



This is not a big surprise the SMJ is slower than the HashJoin, as we do not 
fully utilize the sorting yet, more details can be found at 
https://issues.apache.org/jira/browse/SPARK-2926 .

 

Anyway, can you disable the sort merge join by 
“spark.sql.planner.sortMergeJoin=false;” in Spark 1.5, and run the query again? 
In our previous testing, it’s about 20% slower for sort merge join. I am not 
sure if there anything else slow down the performance.

 

Hao

 

 

From: Jesse F Chen [mailto:jfc...@us.ibm.com]
Sent: Friday, September 11, 2015 1:18 PM
To: Michael Armbrust
Cc: Todd; user@spark.apache.org
Subject: Re: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 
1.4.1 SQL

 

Could this be a build issue (i.e., sbt package)?

If I ran the same jar build for 1.4.1 in 1.5, I am seeing large regression too 
in queries (all other things identical)...

I am curious, to build 1.5 (when it isn't released yet), what do I need to do 
with the build.sbt file?

any special parameters i should be using to make sure I load the latest hive 
dependencies?

Michael Armbrust ---09/10/2015 11:07:28 AM---I've been running TPC-DS SF=1500 
daily on Spark 1.4.1 and Spark 1.5 on S3, so this is surprising.  I

From: Michael Armbrust 
To: Todd 
Cc: "user@spark.apache.org" 
Date: 09/10/2015 11:07 AM
Subject: Re: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 
1.4.1 SQL




I've been running TPC-DS SF=1500 daily on Spark 1.4.1 and Spark 1.5 on S3, so 
this is surprising.  In my experiments Spark 1.5 is either the same or faster 
than 1.4 with only small exceptions.  A few thoughts,

 - 600 partitions is probably way too many for 6G of data.
 - Providing the output of explain for both runs would be helpful whenever 
reporting performance changes.

On Thu, Sep 10, 2015 at 1:24 AM, Todd  wrote:

Hi,

I am using data generated with 
sparksqlperf(https://github.com/databricks/spark-sql-perf) to test the spark 
sql performance (spark on yarn, with 10 nodes) with the following code (The 
table store_sales is about 90 million records, 6G in size)
 
val outputDir="hdfs://tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales"
val name="store_sales"
sqlContext.sql(
  s"""
  |CREATE TEMPORARY TABLE ${name}
  |USING org.apache.spark.sql.parquet
  |OPTIONS (
  |  path '${outputDir}'
  |)
""".stripMargin)

val sql="""
 |select
 |  t1.ss_quantity,
 |  t1.ss_list_price,
 |  t1.ss_coupon_amt,
 |  t1.ss_cdemo_sk,
 |  t1.ss_item_sk,
 |  t1.ss_promo_sk,
 |  t1.ss_sold_date_sk
 |from store_sales t1 join store_sales t2 on t1.ss_item_sk = 
t2.ss_item_sk
 |where
 |  t1.ss_sold_date_sk between 2450815 and 2451179
   """.stripMargin

val df = sqlContext.sql(sql)
df.rdd.foreach(row=>Unit)

With 1.4.1, I can finish the query in 6 minutes,  but  I need 10+ minutes with 
1.5.

The configuration are basically the same, since I copy the configuration from 
1.4.1 to 1.5:

sparkVersion1.4.11.5.0
scaleFactor3030
spark.sql.shuffle.partitions600600
spark.sql.sources.partitionDiscovery.enabledtruetrue
spark.default.parallelism200200
spark.driver.memory4G4G4G
spark.executor.memory4G4G

Data lost in spark streaming

2015-09-11 Thread Bin Wang
I'm using spark streaming 1.4.0 and have a DStream that have all the data
it received. But today the history data in the DStream seems to be lost
suddenly. And the application UI also lost the streaming process time and
all the related data. Could any give some hint to debug this? Thanks.


Help with collect() in Spark Streaming

2015-09-11 Thread Holden Karau
A common practice to do this is to use foreachRDD with a local var to
accumulate the data (you can see it in the Spark Streaming test code).

That being said, I am a little curious why you want the driver to create
the file specifically.

On Friday, September 11, 2015, allonsy > wrote:

> Hi everyone,
>
> I have a JavaPairDStream object and I'd like the Driver to
> create a txt file (on HDFS) containing all of its elements.
>
> At the moment, I use the /coalesce(1, true)/ method:
>
>
> JavaPairDStream unified = [partitioned stuff]
> unified.foreachRDD(new Function, Void>() {
> public Void call(JavaPairRDD String> arg0) throws Exception {
> arg0.coalesce(1,
> true).saveAsTextFile();
> return null;
> }
> });
>
>
> but this implies that a /single worker/ is taking all the data and writing
> to HDFS, and that could be a major bottleneck.
>
> How could I replace the worker with the Driver? I read that /collect()/
> might do this, but I haven't the slightest idea on how to implement it.
>
> Can anybody help me?
>
> Thanks in advance.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-collect-in-Spark-Streaming-tp24659.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau
Linked In: https://www.linkedin.com/in/holdenkarau


I'd like to add our company to the Powered by Spark page

2015-09-11 Thread Timothy Snyder
Hello,

I'm interested in adding our company to this Powered by Spark 
page. I've 
included some information below, but if you have any questions or need any 
additional information please let me know.


Organization name: Hawk Search
URL: hawksearch.com
Spark Components being used: Spark & MLlib
Short description of our use case: We are using Spark & MLlib for our Hawk 
Search Product Recommendations Module. Spark allows us to improve models and 
provide predictive analytics for marketing automation and personalization.


Thank you for your time,
-Tim


Tim Snyder / Marketing Coordinator - Thanx Media, 
Inc.
ROC Commerce / Hawk 
Search / Site Vibes
O: 630-469-2929

Confidentiality Note: This e-mail message and any attachments to it are 
intended only for the named recipients and may contain confidential 
information. If you are not one of the intended recipients, please do not 
duplicate or forward this e-mail message and immediately delete it from your 
computer.  By accepting and opening this email, recipient agrees to keep all 
information confidential and is not allowed to distribute to anyone outside 
their organization.



Re: RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Davies Liu
I had ran similar benchmark for 1.5, do self join on a fact table with
join key that had many duplicated rows (there are N rows for the same
join key), say N, after join, there will be N*N rows for each join
key. Generating the joined row is slower in 1.5 than 1.4 (it needs to
copy left and right row together, but not in 1.4). If the generated
row is accessed after join, there will be not much difference between
1.5 and 1.4, because accessing the joined row is slower in 1.4 than
1.5.

So, for this particular query, 1.5 is slower than 1.4, will be even
slower if you increase the N. But for real workload, it will not, 1.5
is usually faster than 1.4.

On Fri, Sep 11, 2015 at 1:31 AM, prosp4300  wrote:
>
>
> By the way turn off the code generation could be an option to try, sometime 
> code generation could introduce slowness
>
>
> 在2015年09月11日 15:58,Cheng, Hao 写道:
>
> Can you confirm if the query really run in the cluster mode? Not the local 
> mode. Can you print the call stack of the executor when the query is running?
>
>
>
> BTW: spark.shuffle.reduceLocality.enabled is the configuration of Spark, not 
> Spark SQL.
>
>
>
> From: Todd [mailto:bit1...@163.com]
> Sent: Friday, September 11, 2015 3:39 PM
> To: Todd
> Cc: Cheng, Hao; Jesse F Chen; Michael Armbrust; user@spark.apache.org
> Subject: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ 
> compared with spark 1.4.1 SQL
>
>
>
> I add the following two options:
> spark.sql.planner.sortMergeJoin=false
> spark.shuffle.reduceLocality.enabled=false
>
> But it still performs the same as not setting them two.
>
> One thing is that on the spark ui, when I click the SQL tab, it shows an 
> empty page but the header title 'SQL',there is no table to show queries and 
> execution plan information.
>
>
>
>
>
> At 2015-09-11 14:39:06, "Todd"  wrote:
>
>
> Thanks Hao.
>  Yes,it is still low as SMJ。Let me try the option your suggested,
>
>
>
>
> At 2015-09-11 14:34:46, "Cheng, Hao"  wrote:
>
> You mean the performance is still slow as the SMJ in Spark 1.5?
>
>
>
> Can you set the spark.shuffle.reduceLocality.enabled=false when you start the 
> spark-shell/spark-sql? It’s a new feature in Spark 1.5, and it’s true by 
> default, but we found it probably causes the performance reduce dramatically.
>
>
>
>
>
> From: Todd [mailto:bit1...@163.com]
> Sent: Friday, September 11, 2015 2:17 PM
> To: Cheng, Hao
> Cc: Jesse F Chen; Michael Armbrust; user@spark.apache.org
> Subject: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with 
> spark 1.4.1 SQL
>
>
>
> Thanks Hao for the reply.
> I turn the merge sort join off, the physical plan is below, but the 
> performance is roughly the same as it on...
>
> == Physical Plan ==
> TungstenProject 
> [ss_quantity#10,ss_list_price#12,ss_coupon_amt#19,ss_cdemo_sk#4,ss_item_sk#2,ss_promo_sk#8,ss_sold_date_sk#0]
>  ShuffledHashJoin [ss_item_sk#2], [ss_item_sk#25], BuildRight
>   TungstenExchange hashpartitioning(ss_item_sk#2)
>ConvertToUnsafe
> Scan 
> ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_promo_sk#8,ss_quantity#10,ss_cdemo_sk#4,ss_list_price#12,ss_coupon_amt#19,ss_item_sk#2,ss_sold_date_sk#0]
>   TungstenExchange hashpartitioning(ss_item_sk#25)
>ConvertToUnsafe
> Scan 
> ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_item_sk#25]
>
> Code Generation: true
>
>
>
>
> At 2015-09-11 13:48:23, "Cheng, Hao"  wrote:
>
> This is not a big surprise the SMJ is slower than the HashJoin, as we do not 
> fully utilize the sorting yet, more details can be found at 
> https://issues.apache.org/jira/browse/SPARK-2926 .
>
>
>
> Anyway, can you disable the sort merge join by 
> “spark.sql.planner.sortMergeJoin=false;” in Spark 1.5, and run the query 
> again? In our previous testing, it’s about 20% slower for sort merge join. I 
> am not sure if there anything else slow down the performance.
>
>
>
> Hao
>
>
>
>
>
> From: Jesse F Chen [mailto:jfc...@us.ibm.com]
> Sent: Friday, September 11, 2015 1:18 PM
> To: Michael Armbrust
> Cc: Todd; user@spark.apache.org
> Subject: Re: spark 1.5 SQL slows down dramatically by 50%+ compared with 
> spark 1.4.1 SQL
>
>
>
> Could this be a build issue (i.e., sbt package)?
>
> If I ran the same jar build for 1.4.1 in 1.5, I am seeing large regression 
> too in queries (all other things identical)...
>
> I am curious, to build 1.5 (when it isn't released yet), what do I need to do 
> with the build.sbt file?
>
> any special parameters i should be using to make sure I load the latest hive 
> dependencies?
>
> Michael Armbrust ---09/10/2015 11:07:28 AM---I've been running TPC-DS SF=1500 
> daily on Spark 1.4.1 and Spark 1.5 on S3, so this is surprising.  I
>
> From: Michael Armbrust 
> To: Todd 
> Cc: "user@spark.apache.org" 
> Date: 09/10/2015 11:07 AM

Re: Help with collect() in Spark Streaming

2015-09-11 Thread Luca
Hi,
thanks for answering.

With the *coalesce() *transformation a single worker is in charge of
writing to HDFS, but I noticed that the single write operation usually
takes too much time, slowing down the whole computation (this is
particularly true when 'unified' is made of several partitions). Besides,
'coalesce' forces me to perform a further repartitioning ('true' flag), in
order not to lose upstream parallelism (by the way, did I get this part
right?).
Am I wrong in thinking that having the driver do the writing will speed
things up, without the need of repartitioning data?

Hope I have been clear, I am pretty new to Spark. :)

2015-09-11 18:19 GMT+02:00 Holden Karau :

> A common practice to do this is to use foreachRDD with a local var to
> accumulate the data (you can see it in the Spark Streaming test code).
>
> That being said, I am a little curious why you want the driver to create
> the file specifically.
>
> On Friday, September 11, 2015, allonsy  wrote:
>
>> Hi everyone,
>>
>> I have a JavaPairDStream object and I'd like the Driver
>> to
>> create a txt file (on HDFS) containing all of its elements.
>>
>> At the moment, I use the /coalesce(1, true)/ method:
>>
>>
>> JavaPairDStream unified = [partitioned stuff]
>> unified.foreachRDD(new Function, Void>() {
>> public Void call(JavaPairRDD> String> arg0) throws Exception {
>> arg0.coalesce(1,
>> true).saveAsTextFile();
>> return null;
>> }
>> });
>>
>>
>> but this implies that a /single worker/ is taking all the data and writing
>> to HDFS, and that could be a major bottleneck.
>>
>> How could I replace the worker with the Driver? I read that /collect()/
>> might do this, but I haven't the slightest idea on how to implement it.
>>
>> Can anybody help me?
>>
>> Thanks in advance.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-collect-in-Spark-Streaming-tp24659.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
> Linked In: https://www.linkedin.com/in/holdenkarau
>
>


Re: Help with collect() in Spark Streaming

2015-09-11 Thread Holden Karau
Having the driver write the data instead of a worker probably won't spread
it up, you still need to copy all of the data to a single node. Is there
something which forces you to only write from a single node?

On Friday, September 11, 2015, Luca  wrote:

> Hi,
> thanks for answering.
>
> With the *coalesce() *transformation a single worker is in charge of
> writing to HDFS, but I noticed that the single write operation usually
> takes too much time, slowing down the whole computation (this is
> particularly true when 'unified' is made of several partitions). Besides,
> 'coalesce' forces me to perform a further repartitioning ('true' flag), in
> order not to lose upstream parallelism (by the way, did I get this part
> right?).
> Am I wrong in thinking that having the driver do the writing will speed
> things up, without the need of repartitioning data?
>
> Hope I have been clear, I am pretty new to Spark. :)
>
> 2015-09-11 18:19 GMT+02:00 Holden Karau  >:
>
>> A common practice to do this is to use foreachRDD with a local var to
>> accumulate the data (you can see it in the Spark Streaming test code).
>>
>> That being said, I am a little curious why you want the driver to create
>> the file specifically.
>>
>> On Friday, September 11, 2015, allonsy  wrote:
>>
>>> Hi everyone,
>>>
>>> I have a JavaPairDStream object and I'd like the Driver
>>> to
>>> create a txt file (on HDFS) containing all of its elements.
>>>
>>> At the moment, I use the /coalesce(1, true)/ method:
>>>
>>>
>>> JavaPairDStream unified = [partitioned stuff]
>>> unified.foreachRDD(new Function, Void>()
>>> {
>>> public Void call(JavaPairRDD>> String> arg0) throws Exception {
>>> arg0.coalesce(1,
>>> true).saveAsTextFile();
>>> return null;
>>> }
>>> });
>>>
>>>
>>> but this implies that a /single worker/ is taking all the data and
>>> writing
>>> to HDFS, and that could be a major bottleneck.
>>>
>>> How could I replace the worker with the Driver? I read that /collect()/
>>> might do this, but I haven't the slightest idea on how to implement it.
>>>
>>> Can anybody help me?
>>>
>>> Thanks in advance.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-collect-in-Spark-Streaming-tp24659.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>> Linked In: https://www.linkedin.com/in/holdenkarau
>>
>>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau
Linked In: https://www.linkedin.com/in/holdenkarau


Spark monitoring

2015-09-11 Thread prk77
Is there a way to fetch the current spark cluster memory & cpu usage
programmatically ?
I know that the default spark master web ui has these details but I want to
retrieve them through a program and store them for analysis.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-monitoring-tp24660.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.5.0 java.lang.OutOfMemoryError: PermGen space

2015-09-11 Thread Davies Liu
Did this happen immediately after you start the cluster or after ran
some queries?

Is this in local mode or cluster mode?

On Fri, Sep 11, 2015 at 3:00 AM, Jagat Singh  wrote:
> Hi,
>
> We have queries which were running fine on 1.4.1 system.
>
> We are testing upgrade and even simple query like
>
> val t1= sqlContext.sql("select count(*) from table")
>
> t1.show
>
> This works perfectly fine on 1.4.1 but throws OOM error in 1.5.0
>
> Are there any changes in default memory settings from 1.4.1 to 1.5.0
>
> Thanks,
>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Realtime Data Visualization Tool for Spark

2015-09-11 Thread Jo Sunad
I've found Apache Zeppelin to be a good start if you want to visualize
spark data. It doesn't come with streaming visualizations, although I've
seen people tweak the code so it does let you do real time visualizations
with spark streaming

Other tools I've heard about are python notebook and spark notebook



On Fri, Sep 11, 2015 at 8:56 AM, Shashi Vishwakarma <
shashi.vish...@gmail.com> wrote:

> Hi
>
> I have got streaming data which needs to be processed and send for
> visualization.  I am planning to use spark streaming for this but little
> bit confused in choosing visualization tool. I read somewhere that D3.js
> can be used but i wanted know which is best tool for visualization while
> dealing with streaming application.(something that can be easily integrated)
>
> If someone has any link which can tell about D3.js(or any other
> visualization tool) and Spark streaming application integration  then
> please share . That would be great help.
>
>
> Thanks and Regards
> Shashi
>
>


Re: Realtime Data Visualization Tool for Spark

2015-09-11 Thread Silvio Fiorito
So if you want to build your own from the ground up, then yes you could go the 
d3js route. Like Feynman also responded you could use something like Spark 
Notebook or Zeppelin to create some charts as well. It really depends on your 
intended audience and ultimate goal. If you just want some counters and graphs 
without any interactivity it shouldn't be too difficult.

Another option, if you’re willing to use a hosted service, would be something 
like MS Power BI. I’ve used this to publish data and have realtime dashboards 
and reports fed by Spark.

From: Shashi Vishwakarma
Date: Friday, September 11, 2015 at 11:56 AM
To: "user@spark.apache.org"
Subject: Realtime Data Visualization Tool for Spark

Hi

I have got streaming data which needs to be processed and send for 
visualization.  I am planning to use spark streaming for this but little bit 
confused in choosing visualization tool. I read somewhere that D3.js can be 
used but i wanted know which is best tool for visualization while dealing with 
streaming application.(something that can be easily integrated)

If someone has any link which can tell about D3.js(or any other visualization 
tool) and Spark streaming application integration  then please share . That 
would be great help.


Thanks and Regards
Shashi



Re: Realtime Data Visualization Tool for Spark

2015-09-11 Thread Dean Wampler
Here's a demonstration video from @noootsab himself (creator of Spark
Notebook) showing live charting in Spark Notebook. It's one reason I prefer
it over the other options.

https://twitter.com/noootsab/status/638489244160401408

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
 (O'Reilly)
Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Fri, Sep 11, 2015 at 12:55 PM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

> So if you want to build your own from the ground up, then yes you could go
> the d3js route. Like Feynman also responded you could use something like
> Spark Notebook or Zeppelin to create some charts as well. It really depends
> on your intended audience and ultimate goal. If you just want some counters
> and graphs without any interactivity it shouldn't be too difficult.
>
> Another option, if you’re willing to use a hosted service, would be
> something like MS Power BI. I’ve used this to publish data and have
> realtime dashboards and reports fed by Spark.
>
> From: Shashi Vishwakarma
> Date: Friday, September 11, 2015 at 11:56 AM
> To: "user@spark.apache.org"
> Subject: Realtime Data Visualization Tool for Spark
>
> Hi
>
> I have got streaming data which needs to be processed and send for
> visualization.  I am planning to use spark streaming for this but little
> bit confused in choosing visualization tool. I read somewhere that D3.js
> can be used but i wanted know which is best tool for visualization while
> dealing with streaming application.(something that can be easily integrated)
>
> If someone has any link which can tell about D3.js(or any other
> visualization tool) and Spark streaming application integration  then
> please share . That would be great help.
>
>
> Thanks and Regards
> Shashi
>
>


Re: Spark based Kafka Producer

2015-09-11 Thread Atul Kulkarni
Folks,

Any help on this?

Regards,
Atul.


On Fri, Sep 11, 2015 at 8:39 AM, Atul Kulkarni 
wrote:

> Hi Raghavendra,
>
> Thanks for your answers, I am passing 10 executors and I am not sure if
> that is the problem. It is still hung.
>
> Regards,
> Atul.
>
>
> On Fri, Sep 11, 2015 at 12:40 AM, Raghavendra Pandey <
> raghavendra.pan...@gmail.com> wrote:
>
>> You can pass the number of executors via command line option
>> --num-executors.You need more than 2 executors to make spark-streaming
>> working.
>>
>> For more details on command line option, please go through
>> http://spark.apache.org/docs/latest/running-on-yarn.html.
>>
>>
>> On Fri, Sep 11, 2015 at 10:52 AM, Atul Kulkarni 
>> wrote:
>>
>>> I am submitting the job with yarn-cluster mode.
>>>
>>> spark-submit --master yarn-cluster ...
>>>
>>> On Thu, Sep 10, 2015 at 7:50 PM, Raghavendra Pandey <
>>> raghavendra.pan...@gmail.com> wrote:
>>>
 What is the value of spark master conf.. By default it is local, that
 means only one thread can run and that is why your job is stuck.
 Specify it local[*], to make thread pool equal to number of cores...

 Raghav
 On Sep 11, 2015 6:06 AM, "Atul Kulkarni" 
 wrote:

> Hi Folks,
>
> Below is the code  have for Spark based Kafka Producer to take
> advantage of multiple executors reading files in parallel on my cluster 
> but
> I am stuck at The program not making any progress.
>
> Below is my scrubbed code:
>
> val sparkConf = new SparkConf().setAppName(applicationName)
> val ssc = new StreamingContext(sparkConf, Seconds(2))
>
> val producerObj = ssc.sparkContext.broadcast(KafkaSink(kafkaProperties))
>
> val zipFileDStreams = ssc.textFileStream(inputFiles)
> zipFileDStreams.foreachRDD {
>   rdd =>
> rdd.foreachPartition(
>   partition => {
> partition.foreach{
>   case (logLineText) =>
> println(logLineText)
> producerObj.value.send(topics, logLineText)
> }
>   }
> )
> }
>
> ssc.start()
> ssc.awaitTermination()
>
> ssc.stop()
>
> The code for KafkaSink is as follows.
>
> class KafkaSink(createProducer: () => KafkaProducer[Array[Byte], 
> Array[Byte]]) extends Serializable {
>
>   lazy val producer = createProducer()
>   val logParser = new LogParser()
>
>   def send(topic: String, value: String): Unit = {
>
> val logLineBytes = 
> Bytes.toBytes(logParser.avroEvent(value.split("\t")).toString)
> producer.send(new ProducerRecord[Array[Byte], Array[Byte]](topic, 
> logLineBytes))
>   }
> }
>
> object KafkaSink {
>   def apply(config: Properties): KafkaSink = {
>
> val f = () => {
>   val producer = new KafkaProducer[Array[Byte], Array[Byte]](config, 
> null, null)
>
>   sys.addShutdownHook {
> producer.close()
>   }
>   producer
> }
>
> new KafkaSink(f)
>   }
> }
>
> Disclaimer: it is based on the code inspired by
> http://allegro.tech/spark-kafka-integration.html.
>
> The job just sits there I cannot see any Job Stages being created.
> Something I want to mention - I I am trying to read gzipped files from 
> HDFS
> - could it be that Streaming context is not able to read *.gz files?
>
>
> I am not sure what more details I can provide to help explain my
> problem.
>
>
> --
> Regards,
> Atul Kulkarni
>

>>>
>>>
>>> --
>>> Regards,
>>> Atul Kulkarni
>>>
>>
>>
>
>
> --
> Regards,
> Atul Kulkarni
>



-- 
Regards,
Atul Kulkarni


Help with collect() in Spark Streaming

2015-09-11 Thread allonsy
Hi everyone,

I have a JavaPairDStream object and I'd like the Driver to
create a txt file (on HDFS) containing all of its elements.

At the moment, I use the /coalesce(1, true)/ method:


JavaPairDStream unified = [partitioned stuff]
unified.foreachRDD(new Function, Void>() {
public Void call(JavaPairRDD 
arg0) throws Exception {
arg0.coalesce(1, 
true).saveAsTextFile();
return null;
}
});


but this implies that a /single worker/ is taking all the data and writing
to HDFS, and that could be a major bottleneck.

How could I replace the worker with the Driver? I read that /collect()/
might do this, but I haven't the slightest idea on how to implement it.

Can anybody help me? 

Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-collect-in-Spark-Streaming-tp24659.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Cassandra row count grouped by multiple columns

2015-09-11 Thread Eric Walker
Hi Chirag,

Maybe something like this?

import org.apache.spark.sql._
import org.apache.spark.sql.types._

val rdd = sc.parallelize(Seq(
  Row("A1", "B1", "C1"),
  Row("A2", "B2", "C2"),
  Row("A3", "B3", "C2"),
  Row("A1", "B1", "C1")
))

val schema = StructType(Seq("a", "b", "c").map(c => StructField(c, StringType)))
val df = sqlContext.createDataFrame(rdd, schema)

df.registerTempTable("rows")
sqlContext.sql("select a, b, c, count(0) as count from rows group by
a, b, c").collect()


Eric


On Thu, Sep 10, 2015 at 2:19 AM, Chirag Dewan 
wrote:

> Hi,
>
>
>
> I am using Spark 1.2.0 with Cassandra 2.0.14. I have a problem where I
> need a count of rows unique to multiple columns.
>
>
>
> So I have a column family with 3 columns i.e. a,b,c and for each value of
> distinct a1,b1,c1 I want the row count.
>
>
>
> For eg:
>
> A1,B1,C1
>
> A2,B2,C2
>
> A3,B3,C2
>
> A1,B1,C1
>
>
>
> The output should be:
>
> A1,B1,C1,2
>
> A2,B2,C2,1
>
> A3,B3,C3,1
>
>
>
> What is the optimum way of achieving this?
>
>
>
> Thanks in advance.
>
>
>
> Chirag
>


Re: Model summary for linear and logistic regression.

2015-09-11 Thread Feynman Liang
Sorry! The documentation is not the greatest thing in the world, but these
features are documented here


On Fri, Sep 11, 2015 at 6:25 AM, Sebastian Kuepers <
sebastian.kuep...@publicispixelpark.de> wrote:

> Hey,
>
>
> the 1.5.0 release note say, that there are now model summaries for
> logistic regressions available.
>
> But I can't find them in the current documentary.
>
> ​
>
> Any help very much appreciated!
>
> Thanks
>
>
> Sebastian
>
>
>
>
> 
> Disclaimer The information in this email and any attachments may contain
> proprietary and confidential information that is intended for the
> addressee(s) only. If you are not the intended recipient, you are hereby
> notified that any disclosure, copying, distribution, retention or use of
> the contents of this information is prohibited. When addressed to our
> clients or vendors, any information contained in this e-mail or any
> attachments is subject to the terms and conditions in any governing
> contract. If you have received this e-mail in error, please immediately
> contact the sender and delete the e-mail.
>


Re: Few Conceptual Questions on Spark-SQL and HiveQL

2015-09-11 Thread Narayanan K
Hi there ?
Any replied :)

-Narayanan

On Fri, Sep 11, 2015 at 1:51 AM, Narayanan K  wrote:
> Hi all,
>
> We are migrating from Hive to Spark. We used Spark-SQL CLI to run our
> Hive Queries for performance testing. I am new to Spark and had few
> clarifications. We have :
>
>
> 1. Set up 10 boxes, one master and 9 slaves in standalone mode. Each
> of the boxes are launchers to our external Hadoop grid.
> 2. Copied hive-site.xml to spark conf. The hive metastore uri is
> external to our spark cluster.
> 3. Use spark-sql CLI to submit direct hive queries from the master
> host. Our Hive Queries hit hive tables on the remote hdfs cluster
> which are in ORC format.
>
>
> Questions :
>
> 1. What are the sequence of steps involved from the time a HQL is
> submitted to execution of query in spark cluster ?
> 2. Was an RDD created to read OrcFile from remote hdfs ? Did it get
> the storage information from the hive metastore ?
> 3. Since hdfs cluster is remote from spark cluster, how is data
> locality obtained here ?
> 4. Does running queries in Spark-SQL CLI and access remote hive
> metastore incur any cost in query performance ?
> 5. In Spark SQL programming guide , it is mentioned , Spark-SQL CLI is
> only for local mode. What does this mean ? We were able to submit 100s
> of queries using the CLI. Is there any downside to this approach ?
> 6. Is it possible to create one hivecontext, add all udf jar once and
> submit 100 queries with the same hive context ?
>
> Thanks
> Narayanan

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkR connection string to Cassandra

2015-09-11 Thread Austin Trombley
Spark,

Do you have a SparkR connection string example of an RJDBC connection to a 
cassandra cluster?  Thanks
--
regards,
Austin Trombley, MBA
Senior Manager – Business Intelligence
cell: 415-767-6179





CONFIDENTIALITY STATEMENT: This email message, together with all attachments, 
is intended only for the individual or entity to which it is addressed and may 
contain legally privileged or confidential information. Any dissemination, 
distribution or copying of this communication by persons or entities other than 
the intended recipient, is strictly prohibited, and may be unlawful. If you 
have received this communication in error please contact the sender immediately 
and delete the transmitted material and all copies from your system, or if 
received in hard copy format, return the material to us via the United States 
Postal Service. Thank you.


sparksql query hive data error

2015-09-11 Thread stark_summer
start hive  metastore service  OK 
hadoop  io compression codec is lzo, configure is core-site.xml


io.compression.codecs
   
org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.LzmaCodec



io.compression.codec.lzo.class
com.hadoop.compression.lzo.LzoCodec


lib: spark_home/lib/ 
elephant-bird-lzma-1.0.jar
,hadoop-lzo-0.4.15-cdh5.1.0.jar,mysql-connector-java-5.1.35.jar

spark-env.sh:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/cluster/apps/hadoop/lib/native

export
SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/cluster/apps/hadoop/share/hadoop/yarn/:/home/cluster/apps/hadoop/share/hadoop/yarn/lib/:/home/cluster/apps/hadoop/share/hadoop/common/:/home/cluster/apps/hadoop/share/hadoop/common/lib/:/home/cluster/apps/hadoop/share/hadoop/hdfs/:/home/cluster/apps/hadoop/share/hadoop/hdfs/lib/:/home/cluster/apps/hadoop/share/hadoop/mapreduce/:/home/cluster/apps/hadoop/share/hadoop/mapreduce/lib/:/home/cluster/apps/hadoop/share/hadoop/tools/lib/:/home/cluster/apps/spark/spark-1.4.1/lib/
#export
SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/cluster/apps/hadoop/share/hadoop/yarn/:/home/cluster/apps/hadoop/share/hadoop/yarn/lib/:/home/cluster/apps/hadoop/share/hadoop/common/:/home/cluster/apps/hadoop/share/hadoop/common/lib/:/home/cluster/apps/hadoop/share/hadoop/hdfs/:/home/cluster/apps/hadoop/share/hadoop/hdfs/lib/:/home/cluster/apps/hadoop/share/hadoop/mapreduce/:/home/cluster/apps/hadoop/share/hadoop/mapreduce/lib/:/home/cluster/apps/hadoop/share/hadoop/tools/lib/

export
SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/home/cluster/apps/hadoop/lib/native
export
HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/home/cluster/apps/hadoop/share/hadoop/common/hadoop-lzo-0.4.15-cdh5.1.0.jar

so i think everything is ok 
but when i start ,spark-sql or spark-shell ,execute command :
select * from pokes;
or
sqlContext.sql("FROM pokes SELECT foo, bar").collect().foreach(println)

that error info:
java.lang.RuntimeException: Error in configuring object
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1781)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:885)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
at org.apache.spark.rdd.RDD.collect(RDD.scala:884)
at
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:105)
at
org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:503)
at
org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:58)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:283)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:218)
at

Exception in Spark-sql insertIntoJDBC command

2015-09-11 Thread Baljeet Singh
Hi,

I’m using Spark-Sql to insert the data in the csv in a table in Sql Server
as database. The createJDBCTable command is working fine with it. But when
i’m trying to insert more records in the same table that I have created in
database using insertIntoJDBC, it is throwing an error message –


Exception in thread “main” com.microsoft.sqlserver.jdbc.SQLServerException:
There is already an object named ‘TestTable1’ in the database.
at
com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:216)
at
com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1515)
at
com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:404)
at
com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:350)
at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:5696)
at
com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:1715)
at
com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:180)
at
com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:155)
at
com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeUpdate(SQLServerPreparedStatement.java:314)
at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:252)
at org.apache.spark.sql.DataFrame.insertIntoJDBC(DataFrame.scala:1495)
at
com.dataframe.load.db.DataFrameToObjectDBCommand.saveInDB(DataFrameToObjectDBCommand.java:156)
at
com.dataframe.load.db.DataFrameToObjectDBCommand.main(DataFrameToObjectDBCommand.java:162)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
15/09/11 12:47:23 INFO spark.SparkContext: Invoking stop() from shutdown
hook
15/09/11 12:47:23 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0
on 10.184.57.103:54266 in memory (size: 19.5 KB, free: 265.4 MB)
15/09/11 12:47:23 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0
on 10.184.57.103:54266 in memory (size: 1821.0 B, free: 265.4 MB).



I have tried insertIntoJDBC using by passing passing last argument as true
and false but it is throwing same exception.
Another thing is i’m using Spark-Sql version 1.4.1 and both the methods,
createJDBCTable and insertIntoJDBC are deprecated. Can u pls help in doing
the same. or if there is any other way to do the same in 1.4.1 version.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Exception-in-Spark-sql-insertIntoJDBC-command-tp24655.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Cheng, Hao
Can you confirm if the query really run in the cluster mode? Not the local 
mode. Can you print the call stack of the executor when the query is running?

BTW: spark.shuffle.reduceLocality.enabled is the configuration of Spark, not 
Spark SQL.

From: Todd [mailto:bit1...@163.com]
Sent: Friday, September 11, 2015 3:39 PM
To: Todd
Cc: Cheng, Hao; Jesse F Chen; Michael Armbrust; user@spark.apache.org
Subject: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ 
compared with spark 1.4.1 SQL

I add the following two options:
spark.sql.planner.sortMergeJoin=false
spark.shuffle.reduceLocality.enabled=false

But it still performs the same as not setting them two.

One thing is that on the spark ui, when I click the SQL tab, it shows an empty 
page but the header title 'SQL',there is no table to show queries and execution 
plan information.




At 2015-09-11 14:39:06, "Todd" > wrote:


Thanks Hao.
 Yes,it is still low as SMJ。Let me try the option your suggested,


At 2015-09-11 14:34:46, "Cheng, Hao" 
> wrote:

You mean the performance is still slow as the SMJ in Spark 1.5?

Can you set the spark.shuffle.reduceLocality.enabled=false when you start the 
spark-shell/spark-sql? It’s a new feature in Spark 1.5, and it’s true by 
default, but we found it probably causes the performance reduce dramatically.


From: Todd [mailto:bit1...@163.com]
Sent: Friday, September 11, 2015 2:17 PM
To: Cheng, Hao
Cc: Jesse F Chen; Michael Armbrust; 
user@spark.apache.org
Subject: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with 
spark 1.4.1 SQL

Thanks Hao for the reply.
I turn the merge sort join off, the physical plan is below, but the performance 
is roughly the same as it on...

== Physical Plan ==
TungstenProject 
[ss_quantity#10,ss_list_price#12,ss_coupon_amt#19,ss_cdemo_sk#4,ss_item_sk#2,ss_promo_sk#8,ss_sold_date_sk#0]
 ShuffledHashJoin [ss_item_sk#2], [ss_item_sk#25], BuildRight
  TungstenExchange hashpartitioning(ss_item_sk#2)
   ConvertToUnsafe
Scan 
ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_promo_sk#8,ss_quantity#10,ss_cdemo_sk#4,ss_list_price#12,ss_coupon_amt#19,ss_item_sk#2,ss_sold_date_sk#0]
  TungstenExchange hashpartitioning(ss_item_sk#25)
   ConvertToUnsafe
Scan 
ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_item_sk#25]

Code Generation: true



At 2015-09-11 13:48:23, "Cheng, Hao" 
> wrote:
This is not a big surprise the SMJ is slower than the HashJoin, as we do not 
fully utilize the sorting yet, more details can be found at 
https://issues.apache.org/jira/browse/SPARK-2926 .

Anyway, can you disable the sort merge join by 
“spark.sql.planner.sortMergeJoin=false;” in Spark 1.5, and run the query again? 
In our previous testing, it’s about 20% slower for sort merge join. I am not 
sure if there anything else slow down the performance.

Hao


From: Jesse F Chen [mailto:jfc...@us.ibm.com]
Sent: Friday, September 11, 2015 1:18 PM
To: Michael Armbrust
Cc: Todd; user@spark.apache.org
Subject: Re: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 
1.4.1 SQL


Could this be a build issue (i.e., sbt package)?

If I ran the same jar build for 1.4.1 in 1.5, I am seeing large regression too 
in queries (all other things identical)...

I am curious, to build 1.5 (when it isn't released yet), what do I need to do 
with the build.sbt file?

any special parameters i should be using to make sure I load the latest hive 
dependencies?

[Inactive hide details for Michael Armbrust ---09/10/2015 11:07:28 AM---I've 
been running TPC-DS SF=1500 daily on Spark 1.4.1 an]Michael Armbrust 
---09/10/2015 11:07:28 AM---I've been running TPC-DS SF=1500 daily on Spark 
1.4.1 and Spark 1.5 on S3, so this is surprising.  I

From: Michael Armbrust >
To: Todd >
Cc: "user@spark.apache.org" 
>
Date: 09/10/2015 11:07 AM
Subject: Re: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 
1.4.1 SQL





I've been running TPC-DS SF=1500 daily on Spark 1.4.1 and Spark 1.5 on S3, so 
this is surprising.  In my experiments Spark 1.5 is either the same or faster 
than 1.4 with only small exceptions.  A few thoughts,

 - 600 partitions is probably way too many for 6G of data.
 - Providing the output of explain for both runs would be helpful whenever 
reporting performance changes.

On Thu, Sep 10, 2015 at 1:24 AM, Todd > 
wrote:
Hi,

I am using data generated with 
sparksqlperf(https://github.com/databricks/spark-sql-perf) 

Re: MongoDB and Spark

2015-09-11 Thread Sandeep Giri
use map-reduce.

On Fri, Sep 11, 2015, 14:32 Mishra, Abhishek 
wrote:

> Hello ,
>
>
>
> Is there any way to query multiple collections from mongodb using spark
> and java.  And i want to create only one Configuration Object. Please help
> if anyone has something regarding this.
>
>
>
>
>
> Thank You
>
> Abhishek
>


Spark 1.5.0 java.lang.OutOfMemoryError: PermGen space

2015-09-11 Thread Jagat Singh
Hi,

We have queries which were running fine on 1.4.1 system.

We are testing upgrade and even simple query like

val t1= sqlContext.sql("select count(*) from table")

t1.show

This works perfectly fine on 1.4.1 but throws OOM error in 1.5.0

Are there any changes in default memory settings from 1.4.1 to 1.5.0

Thanks,


Multilabel classification support

2015-09-11 Thread Yasemin Kaya
Hi,

I want to use Mllib for multilabel classification, but I find
http://spark.apache.org/docs/latest/mllib-classification-regression.html,
it is not what I mean. Is there a way to use  multilabel classification?
Thanks alot.

Best,
yasemin

-- 
hiç ender hiç


MongoDB and Spark

2015-09-11 Thread Mishra, Abhishek
Hello ,

Is there any way to query multiple collections from mongodb using spark and 
java.  And i want to create only one Configuration Object. Please help if 
anyone has something regarding this.


Thank You
Abhishek


RE: MongoDB and Spark

2015-09-11 Thread Mishra, Abhishek
Anything using Spark RDD’s ???

Abhishek

From: Sandeep Giri [mailto:sand...@knowbigdata.com]
Sent: Friday, September 11, 2015 3:19 PM
To: Mishra, Abhishek; user@spark.apache.org; d...@spark.apache.org
Subject: Re: MongoDB and Spark


use map-reduce.

On Fri, Sep 11, 2015, 14:32 Mishra, Abhishek 
> wrote:
Hello ,

Is there any way to query multiple collections from mongodb using spark and 
java.  And i want to create only one Configuration Object. Please help if 
anyone has something regarding this.


Thank You
Abhishek


Re: Spark based Kafka Producer

2015-09-11 Thread Raghavendra Pandey
You can pass the number of executors via command line option
--num-executors.You need more than 2 executors to make spark-streaming
working.

For more details on command line option, please go through
http://spark.apache.org/docs/latest/running-on-yarn.html.


On Fri, Sep 11, 2015 at 10:52 AM, Atul Kulkarni 
wrote:

> I am submitting the job with yarn-cluster mode.
>
> spark-submit --master yarn-cluster ...
>
> On Thu, Sep 10, 2015 at 7:50 PM, Raghavendra Pandey <
> raghavendra.pan...@gmail.com> wrote:
>
>> What is the value of spark master conf.. By default it is local, that
>> means only one thread can run and that is why your job is stuck.
>> Specify it local[*], to make thread pool equal to number of cores...
>>
>> Raghav
>> On Sep 11, 2015 6:06 AM, "Atul Kulkarni"  wrote:
>>
>>> Hi Folks,
>>>
>>> Below is the code  have for Spark based Kafka Producer to take advantage
>>> of multiple executors reading files in parallel on my cluster but I am
>>> stuck at The program not making any progress.
>>>
>>> Below is my scrubbed code:
>>>
>>> val sparkConf = new SparkConf().setAppName(applicationName)
>>> val ssc = new StreamingContext(sparkConf, Seconds(2))
>>>
>>> val producerObj = ssc.sparkContext.broadcast(KafkaSink(kafkaProperties))
>>>
>>> val zipFileDStreams = ssc.textFileStream(inputFiles)
>>> zipFileDStreams.foreachRDD {
>>>   rdd =>
>>> rdd.foreachPartition(
>>>   partition => {
>>> partition.foreach{
>>>   case (logLineText) =>
>>> println(logLineText)
>>> producerObj.value.send(topics, logLineText)
>>> }
>>>   }
>>> )
>>> }
>>>
>>> ssc.start()
>>> ssc.awaitTermination()
>>>
>>> ssc.stop()
>>>
>>> The code for KafkaSink is as follows.
>>>
>>> class KafkaSink(createProducer: () => KafkaProducer[Array[Byte], 
>>> Array[Byte]]) extends Serializable {
>>>
>>>   lazy val producer = createProducer()
>>>   val logParser = new LogParser()
>>>
>>>   def send(topic: String, value: String): Unit = {
>>>
>>> val logLineBytes = 
>>> Bytes.toBytes(logParser.avroEvent(value.split("\t")).toString)
>>> producer.send(new ProducerRecord[Array[Byte], Array[Byte]](topic, 
>>> logLineBytes))
>>>   }
>>> }
>>>
>>> object KafkaSink {
>>>   def apply(config: Properties): KafkaSink = {
>>>
>>> val f = () => {
>>>   val producer = new KafkaProducer[Array[Byte], Array[Byte]](config, 
>>> null, null)
>>>
>>>   sys.addShutdownHook {
>>> producer.close()
>>>   }
>>>   producer
>>> }
>>>
>>> new KafkaSink(f)
>>>   }
>>> }
>>>
>>> Disclaimer: it is based on the code inspired by
>>> http://allegro.tech/spark-kafka-integration.html.
>>>
>>> The job just sits there I cannot see any Job Stages being created.
>>> Something I want to mention - I I am trying to read gzipped files from HDFS
>>> - could it be that Streaming context is not able to read *.gz files?
>>>
>>>
>>> I am not sure what more details I can provide to help explain my problem.
>>>
>>>
>>> --
>>> Regards,
>>> Atul Kulkarni
>>>
>>
>
>
> --
> Regards,
> Atul Kulkarni
>


Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Todd
I add the following two options:
spark.sql.planner.sortMergeJoin=false
spark.shuffle.reduceLocality.enabled=false

But it still performs the same as not setting them two.

One thing is that on the spark ui, when I click the SQL tab, it shows an empty 
page but the header title 'SQL',there is no table to show queries and execution 
plan information.








At 2015-09-11 14:39:06, "Todd"  wrote:


Thanks Hao.
 Yes,it is still low as SMJ。Let me try the option your suggested,




At 2015-09-11 14:34:46, "Cheng, Hao"  wrote:


You mean the performance is still slow as the SMJ in Spark 1.5?

 

Can you set the spark.shuffle.reduceLocality.enabled=false when you start the 
spark-shell/spark-sql? It’s a new feature in Spark 1.5, and it’s true by 
default, but we found it probably causes the performance reduce dramatically.

 

 

From: Todd [mailto:bit1...@163.com]
Sent: Friday, September 11, 2015 2:17 PM
To: Cheng, Hao
Cc: Jesse F Chen; Michael Armbrust; user@spark.apache.org
Subject: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with 
spark 1.4.1 SQL

 

Thanks Hao for the reply.
I turn the merge sort join off, the physical plan is below, but the performance 
is roughly the same as it on...

== Physical Plan ==
TungstenProject 
[ss_quantity#10,ss_list_price#12,ss_coupon_amt#19,ss_cdemo_sk#4,ss_item_sk#2,ss_promo_sk#8,ss_sold_date_sk#0]
 ShuffledHashJoin [ss_item_sk#2], [ss_item_sk#25], BuildRight
  TungstenExchange hashpartitioning(ss_item_sk#2)
   ConvertToUnsafe
Scan 
ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_promo_sk#8,ss_quantity#10,ss_cdemo_sk#4,ss_list_price#12,ss_coupon_amt#19,ss_item_sk#2,ss_sold_date_sk#0]
  TungstenExchange hashpartitioning(ss_item_sk#25)
   ConvertToUnsafe
Scan 
ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_item_sk#25]

Code Generation: true







At 2015-09-11 13:48:23, "Cheng, Hao"  wrote:



This is not a big surprise the SMJ is slower than the HashJoin, as we do not 
fully utilize the sorting yet, more details can be found at 
https://issues.apache.org/jira/browse/SPARK-2926 .

 

Anyway, can you disable the sort merge join by 
“spark.sql.planner.sortMergeJoin=false;” in Spark 1.5, and run the query again? 
In our previous testing, it’s about 20% slower for sort merge join. I am not 
sure if there anything else slow down the performance.

 

Hao

 

 

From: Jesse F Chen [mailto:jfc...@us.ibm.com]
Sent: Friday, September 11, 2015 1:18 PM
To: Michael Armbrust
Cc: Todd; user@spark.apache.org
Subject: Re: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 
1.4.1 SQL

 

Could this be a build issue (i.e., sbt package)?

If I ran the same jar build for 1.4.1 in 1.5, I am seeing large regression too 
in queries (all other things identical)...

I am curious, to build 1.5 (when it isn't released yet), what do I need to do 
with the build.sbt file?

any special parameters i should be using to make sure I load the latest hive 
dependencies?

Michael Armbrust ---09/10/2015 11:07:28 AM---I've been running TPC-DS SF=1500 
daily on Spark 1.4.1 and Spark 1.5 on S3, so this is surprising.  I

From: Michael Armbrust 
To: Todd 
Cc: "user@spark.apache.org" 
Date: 09/10/2015 11:07 AM
Subject: Re: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 
1.4.1 SQL




I've been running TPC-DS SF=1500 daily on Spark 1.4.1 and Spark 1.5 on S3, so 
this is surprising.  In my experiments Spark 1.5 is either the same or faster 
than 1.4 with only small exceptions.  A few thoughts,

 - 600 partitions is probably way too many for 6G of data.
 - Providing the output of explain for both runs would be helpful whenever 
reporting performance changes.

On Thu, Sep 10, 2015 at 1:24 AM, Todd  wrote:

Hi,

I am using data generated with 
sparksqlperf(https://github.com/databricks/spark-sql-perf) to test the spark 
sql performance (spark on yarn, with 10 nodes) with the following code (The 
table store_sales is about 90 million records, 6G in size)
 
val outputDir="hdfs://tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales"
val name="store_sales"
sqlContext.sql(
  s"""
  |CREATE TEMPORARY TABLE ${name}
  |USING org.apache.spark.sql.parquet
  |OPTIONS (
  |  path '${outputDir}'
  |)
""".stripMargin)

val sql="""
 |select
 |  t1.ss_quantity,
 |  t1.ss_list_price,
 |  t1.ss_coupon_amt,
 |  t1.ss_cdemo_sk,
 |  t1.ss_item_sk,
 |  t1.ss_promo_sk,
 |  t1.ss_sold_date_sk
 |from store_sales t1 join store_sales t2 on t1.ss_item_sk = 
t2.ss_item_sk
 |where
 |  t1.ss_sold_date_sk between 2450815 and 2451179
   """.stripMargin

val df = sqlContext.sql(sql)
df.rdd.foreach(row=>Unit)

With 

Re: Multilabel classification support

2015-09-11 Thread Alexis Gillain
You can try these packages for adaboost.mh :

https://github.com/BaiGang/spark_multiboost (scala)
or
https://github.com/tizfa/sparkboost (java)


2015-09-11 15:29 GMT+08:00 Yasemin Kaya :

> Hi,
>
> I want to use Mllib for multilabel classification, but I find
> http://spark.apache.org/docs/latest/mllib-classification-regression.html,
> it is not what I mean. Is there a way to use  multilabel classification?
> Thanks alot.
>
> Best,
> yasemin
>
> --
> hiç ender hiç
>



-- 
Alexis GILLAIN


Few Conceptual Questions on Spark-SQL and HiveQL

2015-09-11 Thread Narayanan K
Hi all,

We are migrating from Hive to Spark. We used Spark-SQL CLI to run our
Hive Queries for performance testing. I am new to Spark and had few
clarifications. We have :


1. Set up 10 boxes, one master and 9 slaves in standalone mode. Each
of the boxes are launchers to our external Hadoop grid.
2. Copied hive-site.xml to spark conf. The hive metastore uri is
external to our spark cluster.
3. Use spark-sql CLI to submit direct hive queries from the master
host. Our Hive Queries hit hive tables on the remote hdfs cluster
which are in ORC format.


Questions :

1. What are the sequence of steps involved from the time a HQL is
submitted to execution of query in spark cluster ?
2. Was an RDD created to read OrcFile from remote hdfs ? Did it get
the storage information from the hive metastore ?
3. Since hdfs cluster is remote from spark cluster, how is data
locality obtained here ?
4. Does running queries in Spark-SQL CLI and access remote hive
metastore incur any cost in query performance ?
5. In Spark SQL programming guide , it is mentioned , Spark-SQL CLI is
only for local mode. What does this mean ? We were able to submit 100s
of queries using the CLI. Is there any downside to this approach ?
6. Is it possible to create one hivecontext, add all udf jar once and
submit 100 queries with the same hive context ?

Thanks
Narayanan

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark on Mesos with Jobs in Cluster Mode Documentation

2015-09-11 Thread Tim Chen
Yes you can create an issue, or actually contribute a patch to update it :)

Sorry the docs is a bit light, I'm going to make it more complete along the
way.

Tim


On Fri, Sep 11, 2015 at 11:11 AM, Tom Waterhouse (tomwater) <
tomwa...@cisco.com> wrote:

> Tim,
>
> Thank you for the explanation.  You are correct, my Mesos experience is
> very light, and I haven’t deployed anything via Marathon yet.  What you
> have stated here makes sense, I will look into doing this.
>
> Adding this info to the docs would be great.  Is the appropriate action to
> create an issue regarding improvement of the docs?  For those of us who are
> gaining the experience having such a pointer is very helpful.
>
> Tom
>
> From: Tim Chen 
> Date: Thursday, September 10, 2015 at 10:25 AM
> To: Tom Waterhouse 
> Cc: "user@spark.apache.org" 
> Subject: Re: Spark on Mesos with Jobs in Cluster Mode Documentation
>
> Hi Tom,
>
> Sorry the documentation isn't really rich, since it's probably assuming
> users understands how Mesos and framework works.
>
> First I need explain the rationale of why create the dispatcher. If you're
> not familiar with Mesos yet, each node in your datacenter is installed a
> Mesos slave where it's responsible for publishing resources and
> running/watching tasks, and Mesos master is responsible for taking the
> aggregated resources and scheduling them among frameworks.
>
> Frameworks are not managed by Mesos, as Mesos master/slave doesn't launch
> and maintain framework but assume they're launched and kept running on its
> own. All the existing frameworks in the ecosystem therefore all have their
> own ways to deploy, HA and persist state (e.g: Aurora, Marathon, etc).
>
> Therefore, to introduce cluster mode with Mesos, we must create a
> framework that is long running that can be running in your datacenter, and
> can handle launching spark drivers on demand and handle HA, etc. This is
> what the dispatcher is all about.
>
> So the idea is that you should launch the dispatcher not on the client,
> but on a machine in your datacenter. In Mesosphere's DCOS we launch all
> frameworks and long running services with Marathon, and you can use
> Marathon to launch the Spark dispatcher.
>
> Then all clients instead of specifying the Mesos master URL (e.g:
> mesos://mesos.master:2181), then just talks to the dispatcher only
> (mesos://spark-dispatcher.mesos:7077), and the dispatcher will then start
> and watch the driver for you.
>
> Tim
>
>
>
> On Thu, Sep 10, 2015 at 10:13 AM, Tom Waterhouse (tomwater) <
> tomwa...@cisco.com> wrote:
>
>> After spending most of yesterday scouring the Internet for sources of
>> documentation for submitting Spark jobs in cluster mode to a Spark cluster
>> managed by Mesos I was able to do just that, but I am not convinced that
>> how I have things setup is correct.
>>
>> I used the Mesos published
>> 
>> instructions for setting up my Mesos cluster.  I have three Zookeeper
>> instances, three Mesos master instances, and three Mesos slave instances.
>> This is all running in Openstack.
>>
>> The documentation on the Spark documentation site states that “To use
>> cluster mode, you must start the MesosClusterDispatcher in your cluster via
>> the sbin/start-mesos-dispatcher.sh script, passing in the Mesos master
>> url (e.g: mesos://host:5050).”  That is it, no more information than
>> that.  So that is what I did: I have one machine that I use as the Spark
>> client for submitting jobs.  I started the Mesos dispatcher with script as
>> described, and using the client machine’s IP address and port as the target
>> for the job submitted the job.
>>
>> The job is currently running in Mesos as expected.  This is not however
>> how I would have expected to configure the system.  As running there is one
>> instance of the Spark Mesos dispatcher running outside of Mesos, so not a
>> part of the sphere of Mesos resource management.
>>
>> I used the following Stack Overflow posts as guidelines:
>> http://stackoverflow.com/questions/31164725/spark-mesos-dispatcher
>> http://stackoverflow.com/questions/31294515/start-spark-via-mesos
>>
>> There must be better documentation on how to deploy Spark in Mesos with
>> jobs able to be deployed in cluster mode.
>>
>> I can follow up with more specific information regarding my deployment
>> if necessary.
>>
>> Tom
>>
>
>


Re: selecting columns with the same name in a join

2015-09-11 Thread Michael Armbrust
Here is what I get on branch-1.5:

x = sc.parallelize([dict(k=1, v="Evert"), dict(k=2, v="Erik")]).toDF()
y = sc.parallelize([dict(k=1, v="Ruud"), dict(k=3, v="Vincent")]).toDF()
x.registerTempTable('x')
y.registerTempTable('y')
sqlContext.sql("select y.v, x.v FROM x INNER JOIN y ON x.k=y.k").collect()

Out[1]: [Row(v=u'Ruud', v=u'Evert')]

On Fri, Sep 11, 2015 at 3:14 AM, Evert Lammerts 
wrote:

> Am I overlooking something? This doesn't seem right:
>
> x = sc.parallelize([dict(k=1, v="Evert"), dict(k=2, v="Erik")]).toDF()
> y = sc.parallelize([dict(k=1, v="Ruud"), dict(k=3, v="Vincent")]).toDF()
> x.registerTempTable('x')
> y.registerTempTable('y')
> sqlContext.sql("select y.v, x.v FROM x INNER JOIN y ON x.k=y.k").collect()
>
> Out[26]: [Row(v=u'Evert', v=u'Evert')]
>
> May just be because I'm behind; I'm on:
>
> Spark 1.5.0-SNAPSHOT (git revision 27ef854) built for Hadoop 2.6.0 Build
> flags: -Pyarn -Psparkr -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive
> -Phive-thriftserver -DskipTests
>
> Can somebody check whether the above code does work on the latest release?
>
> Thanks!
> Evert
>


Re: New JavaRDD Inside JavaPairDStream

2015-09-11 Thread Cody Koeninger
No, in general you can't make new RDDs in code running on the executors.

It looks like your properties file is a constant, why not process it at the
beginning of the job and broadcast the result?

On Fri, Sep 11, 2015 at 2:09 PM, Rachana Srivastava <
rachana.srivast...@markmonitor.com> wrote:

> Hello all,
>
>
>
> Can we invoke JavaRDD while processing stream from Kafka for example.
> Following code is throwing some serialization exception.  Not sure if this
> is feasible.
>
>
>
>   JavaStreamingContext jssc = *new* JavaStreamingContext(jsc, Durations.
> *seconds*(5));
>
> JavaPairReceiverInputDStream messages = KafkaUtils.
> *createStream*(jssc, zkQuorum, group, topicMap);
>
> JavaDStream lines = messages.map(*new* *Function String>, String>()* {
>
>   *public* String call(Tuple2 tuple2) { *return*
> tuple2._2();
>
>   }
>
> });
>
> JavaPairDStream wordCounts = lines.mapToPair( *new* 
> *PairFunction String, String>()* {
>
> *public* Tuple2 call(String urlString) {
>
> String propertiesFile =
> "/home/cloudera/Desktop/sample/input/featurelist.properties";
>
> JavaRDD propertiesFileRDD = jsc.textFile(
> propertiesFile);
>
>   JavaPairRDD featureKeyClassPair
> = propertiesFileRDD.mapToPair(
>
>   *new* *PairFunction String>()* {
>
>   *public* Tuple2 String> call(String property) {
>
> *return* *new**
> Tuple2(**property**.split(**"="**)[0], **property**.split(**"="**)[1])*;
>
>   }
>
>  });
>
> featureKeyClassPair.count();
>
>   *return* *new* Tuple2(urlString,  featureScore);
>
> }
>
>   });
>
>
>


Error - Calling a package (com.databricks:spark-csv_2.10:1.0.3) with spark-submit

2015-09-11 Thread Subhajit Purkayastha
 

I am on spark 1.3.1

 

When I do the following with spark-shell, it works

 

spark-shell --packages com.databricks:spark-csv_2.10:1.0.3

 

Then I can create a DF using the spark-csv package

import sqlContext.implicits._

import org.apache.spark.sql._

 

//  Return the dataset specified by data source as a DataFrame, use the
header for column names

val df = sqlContext.load("com.databricks.spark.csv", Map("path" ->
"sfpd.csv", "header" -> "true"))

 

Now, I want to do the above as part of my package using spark-submit

spark-submit   --class "ServiceProject"   --master local[4]   --packages
com.databricks:spark-csv_2.10:1.0.3
target/scala-2.10/serviceproject_2.10-1.0.jar

 

Ivy Default Cache set to: /root/.ivy2/cache

The jars for the packages stored in: /root/.ivy2/jars

:: loading settings :: url =
jar:file:/usr/hdp/2.3.0.0-2557/spark/lib/spark-assembly-1.3.1.2.3.0.0-2557-h
adoop2.7.1.2.3.0.0-2557.jar!/org/apache/ivy/core/settings/ivysettings.xml

com.databricks#spark-csv_2.10 added as a dependency

:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0

confs: [default]

found com.databricks#spark-csv_2.10;1.0.3 in central

found org.apache.commons#commons-csv;1.1 in list

:: resolution report :: resolve 366ms :: artifacts dl 11ms

:: modules in use:

com.databricks#spark-csv_2.10;1.0.3 from central in [default]

org.apache.commons#commons-csv;1.1 from list in [default]

 
-

|  |modules||   artifacts
|

|   conf   | number| search|dwnlded|evicted||
number|dwnlded|

 
-

|  default |   2   |   0   |   0   |   0   ||   2   |   0
|

 
-

:: retrieving :: org.apache.spark#spark-submit-parent

confs: [default]

0 artifacts copied, 2 already retrieved (0kB/19ms)

Error: application failed with exception

java.lang.ArrayIndexOutOfBoundsException: 0

at ServiceProject$.main(ServiceProject.scala:29)

at ServiceProject.main(ServiceProject.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57
)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$ru
nMain(SparkSubmit.scala:577)

at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:174)

at
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:197)

at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

 

 

 

What am I doing wrong?

 

Subhajit

 



which install package type for cassandra use

2015-09-11 Thread beakesland
Hello,

Which install package type is suggested to add spark nodes to an existing
Cassandra cluster?
Will be using to deal with data already stored in Cassandra with connector.
I am not currently running any Hadoop/CDH

Thank you.

Phil



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/which-install-package-type-for-cassandra-use-tp24662.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Davies Liu
Thanks, I'm surprised to see there are so much difference (4x), there
could be something wrong in Spark (some contention between tasks).

On Fri, Sep 11, 2015 at 11:47 AM, Jesse F Chen  wrote:
>
> @Davies...good question..
>
> > Just be curious how the difference would be if you use 20 executors
> > and 20G memory for each executor..
>
> So I tried the following combinations:
>
> (GB X # executors)  (query response time in secs)
> 20X20 415
> 10X40 230
> 5X80 141
> 4X100 128
> 2X200 104
>
> CPU utilization is high so spreading more JVMs onto more vCores helps in this 
> case.
> For other workloads where memory utilization outweighs CPU, i can see larger 
> JVM
> sizes maybe more beneficial. It's for sure case-by-case.
>
> Seems overhead for codegen and scheduler overhead are negligible.
>
>
>
> Davies Liu ---09/11/2015 10:41:23 AM---On Fri, Sep 11, 2015 at 10:31 AM, 
> Jesse F Chen  wrote: >
>
> From: Davies Liu 
> To: Jesse F Chen/San Francisco/IBM@IBMUS
> Cc: "Cheng, Hao" , Todd , Michael 
> Armbrust , "user@spark.apache.org" 
> 
> Date: 09/11/2015 10:41 AM
> Subject: Re: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ 
> compared with spark 1.4.1 SQL
>
> 
>
>
>
> On Fri, Sep 11, 2015 at 10:31 AM, Jesse F Chen  wrote:
> >
> > Thanks Hao!
> >
> > I tried your suggestion of setting 
> > spark.shuffle.reduceLocality.enabled=false and my initial tests showed 
> > queries are on par between 1.5 and 1.4.1.
> >
> > Results:
> >
> > tpcds-query39b-141.out:query time: 129.106478631 sec
> > tpcds-query39b-150-reduceLocality-false.out:query time: 128.854284296 sec
> > tpcds-query39b-150.out:query time: 572.443151734 sec
> >
> > With default  spark.shuffle.reduceLocality.enabled=true, I am seeing 
> > across-the-board slow down for majority of the TPCDS queries.
> >
> > My test is on a bare metal 20-node cluster. I ran the my test as follows:
> >
> > /TestAutomation/spark-1.5/bin/spark-submit  --master yarn-client  
> > --packages com.databricks:spark-csv_2.10:1.1.0 --name TPCDSSparkSQLHC
> > --conf spark.shuffle.reduceLocality.enabled=false
> > --executor-memory 4096m --num-executors 100
> > --class org.apache.spark.examples.sql.hive.TPCDSSparkSQLHC
> > /TestAutomation/databricks/spark-sql-perf-master/target/scala-2.10/tpcdssparksql_2.10-0.9.jar
> > hdfs://rhel2.cisco.com:8020/user/bigsql/hadoopds100g
> > /TestAutomation/databricks/spark-sql-perf-master/src/main/queries/jesse/query39b.sql
> >
>
> Just be curious how the difference would be if you use 20 executors
> and 20G memory for each executor. Share the same JVM for some tasks,
> could reduce the overhead for codegen and JIT, it may also reduce the
> overhead of `reduceLocality`(it can be easier to schedule the tasks).
>
> >
> >
> >
> > "Cheng, Hao" ---09/11/2015 01:00:28 AM---Can you confirm if the query 
> > really run in the cluster mode? Not the local mode. Can you print the c
> >
> > From: "Cheng, Hao" 
> > To: Todd 
> > Cc: Jesse F Chen/San Francisco/IBM@IBMUS, Michael Armbrust 
> > , "user@spark.apache.org" 
> > Date: 09/11/2015 01:00 AM
> > Subject: RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ 
> > compared with spark 1.4.1 SQL
> >
> > 
> >
> >
> >
> > Can you confirm if the query really run in the cluster mode? Not the local 
> > mode. Can you print the call stack of the executor when the query is 
> > running?
> >
> > BTW: spark.shuffle.reduceLocality.enabled is the configuration of Spark, 
> > not Spark SQL.
> >
> > From: Todd [mailto:bit1...@163.com]
> > Sent: Friday, September 11, 2015 3:39 PM
> > To: Todd
> > Cc: Cheng, Hao; Jesse F Chen; Michael Armbrust; user@spark.apache.org
> > Subject: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ 
> > compared with spark 1.4.1 SQL
> >
> > I add the following two options:
> > spark.sql.planner.sortMergeJoin=false
> > spark.shuffle.reduceLocality.enabled=false
> >
> > But it still performs the same as not setting them two.
> >
> > One thing is that on the spark ui, when I click the SQL tab, it shows an 
> > empty page but the header title 'SQL',there is no table to show queries and 
> > execution plan information.
> >
> >
> >
> >
> > At 2015-09-11 14:39:06, "Todd"  wrote:
> >
> >
> > Thanks Hao.
> > Yes,it is still low as SMJ。Let me try the option your suggested,
> >
> >
> > At 2015-09-11 14:34:46, "Cheng, Hao"  wrote:
> >
> > You mean the performance is still slow as the SMJ in Spark 1.5?
> >
> > Can you set the spark.shuffle.reduceLocality.enabled=false when you start 
> > the spark-shell/spark-sql? It’s a new feature in Spark 1.5, and it’s true 
> > by default, but we found it probably causes the 

Re: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Jesse F Chen

@Davies...good question..

> Just be curious how the difference would be if you use 20 executors
> and 20G memory for each executor..

So I tried the following combinations:

(GB X # executors)  (query response time in secs)
20X20   415
10X40   230
5X80141
4X100   128
2X200   104

CPU utilization is high so spreading more JVMs onto more vCores helps in
this case.
For other workloads where memory utilization outweighs CPU, i can see
larger JVM
sizes maybe more beneficial. It's for sure case-by-case.

Seems overhead for codegen and scheduler overhead are negligible.


  

  

  






From:   Davies Liu 
To: Jesse F Chen/San Francisco/IBM@IBMUS
Cc: "Cheng, Hao" , Todd ,
Michael Armbrust ,
"user@spark.apache.org" 
Date:   09/11/2015 10:41 AM
Subject:Re: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by
50%+ compared with spark 1.4.1 SQL



On Fri, Sep 11, 2015 at 10:31 AM, Jesse F Chen  wrote:
>
> Thanks Hao!
>
> I tried your suggestion of setting
spark.shuffle.reduceLocality.enabled=false and my initial tests showed
queries are on par between 1.5 and 1.4.1.
>
> Results:
>
> tpcds-query39b-141.out:query time: 129.106478631 sec
> tpcds-query39b-150-reduceLocality-false.out:query time: 128.854284296 sec
> tpcds-query39b-150.out:query time: 572.443151734 sec
>
> With default  spark.shuffle.reduceLocality.enabled=true, I am seeing
across-the-board slow down for majority of the TPCDS queries.
>
> My test is on a bare metal 20-node cluster. I ran the my test as follows:
>
> /TestAutomation/spark-1.5/bin/spark-submit  --master yarn-client
--packages com.databricks:spark-csv_2.10:1.1.0 --name TPCDSSparkSQLHC
> --conf spark.shuffle.reduceLocality.enabled=false
> --executor-memory 4096m --num-executors 100
> --class org.apache.spark.examples.sql.hive.TPCDSSparkSQLHC
> /TestAutomation/databricks/spark-sql-perf-master/target/scala-2.10/tpcdssparksql_2.10-0.9.jar

> hdfs://rhel2.cisco.com:8020/user/bigsql/hadoopds100g
> /TestAutomation/databricks/spark-sql-perf-master/src/main/queries/jesse/query39b.sql

>

Just be curious how the difference would be if you use 20 executors
and 20G memory for each executor. Share the same JVM for some tasks,
could reduce the overhead for codegen and JIT, it may also reduce the
overhead of `reduceLocality`(it can be easier to schedule the tasks).

>
>
>
> "Cheng, Hao" ---09/11/2015 01:00:28 AM---Can you confirm if the query
really run in the cluster mode? Not the local mode. Can you print the c
>
> From: "Cheng, Hao" 
> To: Todd 
> Cc: Jesse F Chen/San Francisco/IBM@IBMUS, Michael Armbrust
, "user@spark.apache.org" 
> Date: 09/11/2015 01:00 AM
> Subject: RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by
50%+ compared with spark 1.4.1 SQL
>
> 
>
>
>
> Can you confirm if the query really run in the cluster mode? Not the
local mode. Can you print the call stack of the executor when the query is
running?
>
> BTW: spark.shuffle.reduceLocality.enabled is the configuration of Spark,
not Spark SQL.
>
> From: Todd [mailto:bit1...@163.com]
> Sent: Friday, September 11, 2015 3:39 PM
> To: Todd
> Cc: Cheng, Hao; Jesse F Chen; Michael Armbrust; user@spark.apache.org
> Subject: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+
compared with spark 1.4.1 SQL
>
> I add the following two options:
> spark.sql.planner.sortMergeJoin=false
> spark.shuffle.reduceLocality.enabled=false
>
> But it still performs the same as not setting them two.
>
> One thing is that on the spark ui, when I click the SQL tab, it shows an
empty page but the header title 'SQL',there is no table to show queries and
execution plan information.
>
>
>
>
> At 2015-09-11 14:39:06, "Todd"  wrote:
>
>
> Thanks Hao.
> Yes,it is still low as SMJ。Let me try the option your suggested,
>
>
> At 2015-09-11 14:34:46, "Cheng, Hao"  wrote:
>
> You mean the performance is still slow as the SMJ in Spark 1.5?
>
> Can you set the spark.shuffle.reduceLocality.enabled=false when you start
the spark-shell/spark-sql? It’s a new feature in Spark 1.5, and it’s true
by default, but we found it probably causes the performance reduce
dramatically.
>
>
> From: Todd 

New JavaRDD Inside JavaPairDStream

2015-09-11 Thread Rachana Srivastava
Hello all,

Can we invoke JavaRDD while processing stream from Kafka for example.  
Following code is throwing some serialization exception.  Not sure if this is 
feasible.

  JavaStreamingContext jssc = new JavaStreamingContext(jsc, 
Durations.seconds(5));
JavaPairReceiverInputDStream messages = 
KafkaUtils.createStream(jssc, zkQuorum, group, topicMap);
JavaDStream lines = messages.map(new Function, String>() {
  public String call(Tuple2 tuple2) { return tuple2._2();
  }
});
JavaPairDStream wordCounts = lines.mapToPair( new 
PairFunction() {
public Tuple2 call(String urlString) {
String propertiesFile = 
"/home/cloudera/Desktop/sample/input/featurelist.properties";
JavaRDD propertiesFileRDD = 
jsc.textFile(propertiesFile);
  JavaPairRDD featureKeyClassPair = 
propertiesFileRDD.mapToPair(
  new PairFunction() {
  public Tuple2 
call(String property) {
return new 
Tuple2(property.split("=")[0], property.split("=")[1]);
  }
 });
featureKeyClassPair.count();
  return new Tuple2(urlString,  featureScore);
}
  });



Re: java.util.NoSuchElementException: key not found

2015-09-11 Thread Yin Huai
Looks like you hit https://issues.apache.org/jira/browse/SPARK-10422, it
has been fixed in branch 1.5. 1.5.1 release will have it.

On Fri, Sep 11, 2015 at 3:35 AM, guoqing0...@yahoo.com.hk <
guoqing0...@yahoo.com.hk> wrote:

> Hi all ,
> After upgrade spark to 1.5 ,  Streaming throw
> java.util.NoSuchElementException: key not found occasionally , is the
> problem of data cause this error ?  please help me if anyone got similar
> problem before , Thanks very much.
>
> the exception accur when write into database.
>
>
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 
> (TID 76, slave2): java.util.NoSuchElementException: key not found: 
> ruixue.sys.session.request
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:58)
> at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
>
> at 
> org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
>
> at 
> org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
>
> at 
> org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
>
> at 
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:152)
>
> at 
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:152)
>
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>
> at 
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:152)
>
> at 
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120)
>
> at 
> org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
>
> at 
> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
>
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
>
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>
> --
> guoqing0...@yahoo.com.hk
>


Re: Training the MultilayerPerceptronClassifier

2015-09-11 Thread Feynman Liang
Rory,

I just sent a PR (https://github.com/avulanov/ann-benchmark/pull/1) to
bring that benchmark up to date. Hope it helps.

On Fri, Sep 11, 2015 at 6:39 AM, Rory Waite  wrote:

> Hi,
>
> I’ve been trying to train the new MultilayerPerceptronClassifier in spark
> 1.5 for the MNIST digit recognition task. I’m trying to reproduce the work
> here:
>
> https://github.com/avulanov/ann-benchmark
>
> The API has changed since this work, so I’m not sure that I’m setting up
> the task correctly.
>
> After I've trained the classifier, it classifies everything as a 1. It
> even does this for the training set. I am doing something wrong with the
> setup? I’m not looking for state of the art performance, just something
> that looks reasonable. This experiment is meant to be a quick sanity test.
>
> Here is the job:
>
> import org.apache.log4j._
> //Logger.getRootLogger.setLevel(Level.OFF)
> import org.apache.spark.mllib.linalg.Vectors
> import org.apache.spark.mllib.regression.LabeledPoint
> import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
> import org.apache.spark.ml.Pipeline
> import org.apache.spark.ml.PipelineStage
> import org.apache.spark.mllib.util.MLUtils
> import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkContext._
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.SQLContext
> import java.io.FileOutputStream
> import java.io.ObjectOutputStream
>
> object MNIST {
>  def main(args: Array[String]) {
>val conf = new SparkConf().setAppName("MNIST")
>conf.set("spark.driver.extraJavaOptions", "-XX:MaxPermSize=512M")
>val sc = new SparkContext(conf)
>val batchSize = 100
>val numIterations = 5
>val mlp = new MultilayerPerceptronClassifier
>mlp.setLayers(Array[Int](780, 2500, 2000, 1500, 1000, 500, 10))
>mlp.setMaxIter(numIterations)
>mlp.setBlockSize(batchSize)
>val train = MLUtils.loadLibSVMFile(sc,
> "file:///misc/home/rwaite/mt-work/ann-benchmark/mnist.scale")
>train.repartition(200)
>val sqlContext = new SQLContext(sc)
>import sqlContext.implicits._
>val df = train.toDF
>val model = mlp.fit(df)
>val trainPredictions = model.transform(df)
>trainPredictions.show(100)
>val test = MLUtils.loadLibSVMFile(sc,
> "file:///misc/home/rwaite/mt-work/ann-benchmark/mnist.scale.t", 780).toDF
>val result = model.transform(test)
>result.show(100)
>val predictionAndLabels = result.select("prediction", "label")
>val evaluator = new MulticlassClassificationEvaluator()
>  .setMetricName("precision")
>println("Precision:" + evaluator.evaluate(predictionAndLabels))
>val fos = new
> FileOutputStream("/home/rwaite/mt-work/ann-benchmark/spark_out/spark_model.obj");
>val oos = new ObjectOutputStream(fos);
>oos.writeObject(model);
>oos.close
>  }
> }
>
>
> And here is the output:
>
> +-++--+
> |label|features|prediction|
> +-++--+
> |  5.0|(780,[152,153,154...|   1.0|
> |  0.0|(780,[127,128,129...|   1.0|
> |  4.0|(780,[160,161,162...|   1.0|
> |  1.0|(780,[158,159,160...|   1.0|
> |  9.0|(780,[208,209,210...|   1.0|
> |  2.0|(780,[155,156,157...|   1.0|
> |  1.0|(780,[124,125,126...|   1.0|
> |  3.0|(780,[151,152,153...|   1.0|
> |  1.0|(780,[152,153,154...|   1.0|
> |  4.0|(780,[134,135,161...|   1.0|
> |  3.0|(780,[123,124,125...|   1.0|
> |  5.0|(780,[216,217,218...|   1.0|
> |  3.0|(780,[143,144,145...|   1.0|
> |  6.0|(780,[72,73,74,99...|   1.0|
> |  1.0|(780,[151,152,153...|   1.0|
> |  7.0|(780,[211,212,213...|   1.0|
> |  2.0|(780,[151,152,153...|   1.0|
> |  8.0|(780,[159,160,161...|   1.0|
> |  6.0|(780,[100,101,102...|   1.0|
> |  9.0|(780,[209,210,211...|   1.0|
> |  4.0|(780,[129,130,131...|   1.0|
> |  0.0|(780,[129,130,131...|   1.0|
> |  9.0|(780,[183,184,185...|   1.0|
> |  1.0|(780,[158,159,160...|   1.0|
> |  1.0|(780,[99,100,101,...|   1.0|
> |  2.0|(780,[124,125,126...|   1.0|
> |  4.0|(780,[185,186,187...|   1.0|
> |  3.0|(780,[150,151,152...|   1.0|
> |  2.0|(780,[145,146,147...|   1.0|
> |  7.0|(780,[240,241,242...|   1.0|
> |  3.0|(780,[201,202,203...|   1.0|
> |  8.0|(780,[153,154,155...|   1.0|
> |  6.0|(780,[71,72,73,74...|   1.0|
> |  9.0|(780,[210,211,212...|   1.0|
> |  0.0|(780,[154,155,156...|   1.0|
> |  5.0|(780,[188,189,190...|   1.0|
> |  6.0|(780,[98,99,100,1...|   1.0|
> |  0.0|(780,[127,128,129...|   1.0|
> |  7.0|(780,[201,202,203...|   1.0|
> |  6.0|(780,[125,126,127...|   1.0|
> |  1.0|(780,[154,155,156...|   1.0|
> |  8.0|(780,[131,132,133...|   1.0|
> |  7.0|(780,[209,210,211...|   1.0|
> |  9.0|(780,[181,182,183...|   1.0|
> |  3.0|(780,[174,175,176...|   1.0|
> |  

countApproxDistinctByKey in python

2015-09-11 Thread LucaMartinetti
Hi,

I am trying to use countApproxDistinctByKey in pyspark but cannot find it. 

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L417

Am I missing something or has not been ported / wrapped yet?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/countApproxDistinctByKey-in-python-tp24663.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SIGTERM 15 Issue : Spark Streaming for ingesting huge text files using custom Receiver

2015-09-11 Thread Varadhan, Jawahar
Hi all,   I have a coded a custom receiver which receives kafka messages. These 
Kafka messages have FTP server credentials in them. The receiver then opens the 
message and uses the ftp credentials in it  to connect to the ftp server. It 
then streams this huge text file (3.3G) . Finally this stream it read line by 
line using buffered reader and pushed to the spark streaming via the receiver's 
"store" method. Spark streaming process receives all these lines and stores it 
in hdfs.
With this process I could ingest small files (50 mb) but cant ingest this 3.3gb 
file.  I get a YARN exception of SIGTERM 15 in spark streaming process. Also, I 
tried going to that 3.3GB file directly (without custom receiver) in spark 
streaming using ssc.textFileStream  and everything works fine and that file 
ends in HDFS
Please let me know what I might have to do to get this working with receiver. I 
know there are better ways to ingest the file but we need to use Spark 
streaming in our case.
Thanks.

UserDefinedTypes

2015-09-11 Thread Richard Eggert
Greetings,

I have recently started using Spark SQL and ran up against two rather odd
limitations related to UserDefinedTypes.

The first is that there appears to be no way to register a UserDefinefType
other than by adding the @SQLUserDefinedType annotation to the class being
mapped. This makes it impossible to define mappings for classes that the
developer does not control. In my case,  I wanted to be able to map the
Instant class from JSR-310 (more specifically,  the Java 7 backport) as a
replacement for java.sql.Timestamp. To my dismay,  I discovered that Spark
SQL provides no way to do this,  as far as I can tell.

Secondly, it seems that Spark SQL absolutely refuses to allow ordering to
be applied to UserDefinedTypes, even when the types are mapped to builtin
types that are themselves ordered. At first I thought that perhaps I had
Kay overlooked the way to do this, but when I looked in the source code I
found,  to my despair,  that the code that implements ordering operations
specially checks that the type extends AtomicType, and bails out with an
exception if it does not.

It seems like it should be relatively straightforward to implement both of
these features. Are there currently any plans to do so?

Note that I am using Spark 1.4.0, but I haven't seen any indication that
either of these are addressed by 1.4.1 or 1.5.0.

Thanks.

Rich Eggert


Re: MongoDB and Spark

2015-09-11 Thread Sandeep Giri
I think it should be possible by loading collections as RDD and then doing
a union on them.

Regards,
Sandeep Giri,
+1 347 781 4573 (US)
+91-953-899-8962 (IN)

www.KnowBigData.com. 
Phone: +1-253-397-1945 (Office)

[image: linkedin icon]  [image:
other site icon]   [image: facebook icon]
 [image: twitter icon]
 


On Fri, Sep 11, 2015 at 3:40 PM, Mishra, Abhishek  wrote:

> Anything using Spark RDD’s ???
>
>
>
> Abhishek
>
>
>
> *From:* Sandeep Giri [mailto:sand...@knowbigdata.com]
> *Sent:* Friday, September 11, 2015 3:19 PM
> *To:* Mishra, Abhishek; user@spark.apache.org; d...@spark.apache.org
> *Subject:* Re: MongoDB and Spark
>
>
>
> use map-reduce.
>
>
>
> On Fri, Sep 11, 2015, 14:32 Mishra, Abhishek 
> wrote:
>
> Hello ,
>
>
>
> Is there any way to query multiple collections from mongodb using spark
> and java.  And i want to create only one Configuration Object. Please help
> if anyone has something regarding this.
>
>
>
>
>
> Thank You
>
> Abhishek
>
>


Training the MultilayerPerceptronClassifier

2015-09-11 Thread Rory Waite
Hi,

I’ve been trying to train the new MultilayerPerceptronClassifier in spark 1.5 
for the MNIST digit recognition task. I’m trying to reproduce the work here:

https://github.com/avulanov/ann-benchmark

The API has changed since this work, so I’m not sure that I’m setting up the 
task correctly.

After I've trained the classifier, it classifies everything as a 1. It even 
does this for the training set. I am doing something wrong with the setup? I’m 
not looking for state of the art performance, just something that looks 
reasonable. This experiment is meant to be a quick sanity test.

Here is the job:

import org.apache.log4j._
//Logger.getRootLogger.setLevel(Level.OFF)
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.PipelineStage
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import java.io.FileOutputStream
import java.io.ObjectOutputStream

object MNIST {
 def main(args: Array[String]) {
   val conf = new SparkConf().setAppName("MNIST")
   conf.set("spark.driver.extraJavaOptions", "-XX:MaxPermSize=512M")
   val sc = new SparkContext(conf)
   val batchSize = 100
   val numIterations = 5
   val mlp = new MultilayerPerceptronClassifier
   mlp.setLayers(Array[Int](780, 2500, 2000, 1500, 1000, 500, 10))
   mlp.setMaxIter(numIterations)
   mlp.setBlockSize(batchSize)
   val train = MLUtils.loadLibSVMFile(sc, 
"file:///misc/home/rwaite/mt-work/ann-benchmark/mnist.scale")
   train.repartition(200)
   val sqlContext = new SQLContext(sc)
   import sqlContext.implicits._
   val df = train.toDF
   val model = mlp.fit(df)
   val trainPredictions = model.transform(df)
   trainPredictions.show(100)
   val test = MLUtils.loadLibSVMFile(sc, 
"file:///misc/home/rwaite/mt-work/ann-benchmark/mnist.scale.t", 780).toDF
   val result = model.transform(test)
   result.show(100)
   val predictionAndLabels = result.select("prediction", "label")
   val evaluator = new MulticlassClassificationEvaluator()
 .setMetricName("precision")
   println("Precision:" + evaluator.evaluate(predictionAndLabels))
   val fos = new 
FileOutputStream("/home/rwaite/mt-work/ann-benchmark/spark_out/spark_model.obj");
   val oos = new ObjectOutputStream(fos);
   oos.writeObject(model);
   oos.close
 }
}


And here is the output:

+-++--+
|label|features|prediction|
+-++--+
|  5.0|(780,[152,153,154...|   1.0|
|  0.0|(780,[127,128,129...|   1.0|
|  4.0|(780,[160,161,162...|   1.0|
|  1.0|(780,[158,159,160...|   1.0|
|  9.0|(780,[208,209,210...|   1.0|
|  2.0|(780,[155,156,157...|   1.0|
|  1.0|(780,[124,125,126...|   1.0|
|  3.0|(780,[151,152,153...|   1.0|
|  1.0|(780,[152,153,154...|   1.0|
|  4.0|(780,[134,135,161...|   1.0|
|  3.0|(780,[123,124,125...|   1.0|
|  5.0|(780,[216,217,218...|   1.0|
|  3.0|(780,[143,144,145...|   1.0|
|  6.0|(780,[72,73,74,99...|   1.0|
|  1.0|(780,[151,152,153...|   1.0|
|  7.0|(780,[211,212,213...|   1.0|
|  2.0|(780,[151,152,153...|   1.0|
|  8.0|(780,[159,160,161...|   1.0|
|  6.0|(780,[100,101,102...|   1.0|
|  9.0|(780,[209,210,211...|   1.0|
|  4.0|(780,[129,130,131...|   1.0|
|  0.0|(780,[129,130,131...|   1.0|
|  9.0|(780,[183,184,185...|   1.0|
|  1.0|(780,[158,159,160...|   1.0|
|  1.0|(780,[99,100,101,...|   1.0|
|  2.0|(780,[124,125,126...|   1.0|
|  4.0|(780,[185,186,187...|   1.0|
|  3.0|(780,[150,151,152...|   1.0|
|  2.0|(780,[145,146,147...|   1.0|
|  7.0|(780,[240,241,242...|   1.0|
|  3.0|(780,[201,202,203...|   1.0|
|  8.0|(780,[153,154,155...|   1.0|
|  6.0|(780,[71,72,73,74...|   1.0|
|  9.0|(780,[210,211,212...|   1.0|
|  0.0|(780,[154,155,156...|   1.0|
|  5.0|(780,[188,189,190...|   1.0|
|  6.0|(780,[98,99,100,1...|   1.0|
|  0.0|(780,[127,128,129...|   1.0|
|  7.0|(780,[201,202,203...|   1.0|
|  6.0|(780,[125,126,127...|   1.0|
|  1.0|(780,[154,155,156...|   1.0|
|  8.0|(780,[131,132,133...|   1.0|
|  7.0|(780,[209,210,211...|   1.0|
|  9.0|(780,[181,182,183...|   1.0|
|  3.0|(780,[174,175,176...|   1.0|
|  9.0|(780,[208,209,210...|   1.0|
|  8.0|(780,[152,153,154...|   1.0|
|  5.0|(780,[186,187,188...|   1.0|
|  9.0|(780,[150,151,152...|   1.0|
|  3.0|(780,[152,153,154...|   1.0|
|  3.0|(780,[122,123,124...|   1.0|
|  0.0|(780,[153,154,155...|   1.0|
|  7.0|(780,[203,204,205...|   1.0|
|  4.0|(780,[212,213,214...|   1.0|
|  9.0|(780,[205,206,207...|   1.0|
|  8.0|(780,[181,182,183...|   1.0|
|  

Re: Multilabel classification support

2015-09-11 Thread Yanbo Liang
LogisticRegression in MLlib(not ML) package supports both multiclass
and multilabel classification.


2015-09-11 16:21 GMT+08:00 Alexis Gillain :

> You can try these packages for adaboost.mh :
>
> https://github.com/BaiGang/spark_multiboost (scala)
> or
> https://github.com/tizfa/sparkboost (java)
>
>
> 2015-09-11 15:29 GMT+08:00 Yasemin Kaya :
>
>> Hi,
>>
>> I want to use Mllib for multilabel classification, but I find
>> http://spark.apache.org/docs/latest/mllib-classification-regression.html,
>> it is not what I mean. Is there a way to use  multilabel classification?
>> Thanks alot.
>>
>> Best,
>> yasemin
>>
>> --
>> hiç ender hiç
>>
>
>
>
> --
> Alexis GILLAIN
>


Re: MLlib LDA implementation questions

2015-09-11 Thread Carsten Schnober
Hi,
I don't have practical experience with the MLlib LDA implementation, but
regarding the variations in the topic matrix: LDA make use of stochastic
processes. If you use setSeed(seed) with the same value for seed during
initialization, your results should be identical though.

May I ask what exactly you refer to with prediction? Topic assignments
(inference)?

Best,
Carsten


Am 11.09.2015 um 15:29 schrieb Marko Asplund:
> Hi,
> 
> We're considering using Spark MLlib (v >= 1.5) LDA implementation for
> topic modelling. We plan to train the model using a data set of about 12
> M documents and vocabulary size of 200-300 k items. Documents are
> relatively short, typically containing less than 10 words, but the
> number can range up to tens of words. The model would be updated
> periodically using e.g. a batch process while predictions will be
> queried by a long-running application process in which we plan to embed
> MLlib.
> 
> Is the MLlib LDA implementation considered to be well-suited to this
> kind of use case?
> 
> I did some prototyping based on the code samples on "MLlib - Clustering"
> page and noticed that the topics matrix values seem to vary quite a bit
> across training runs even with the exact same input data set. During
> prediction I observed similar behaviour.
> Is this due to the probabilistic nature of the LDA algorithm?
> 
> Any caveats to be aware of with the LDA implementation?
> 
> For reference, my prototype code can be found here:
> https://github.com/marko-asplund/tech-protos/blob/master/mllib-lda/src/main/scala/fi/markoa/proto/mllib/LDADemo.scala
> 
> 
> thanks,
> marko

-- 
Carsten Schnober
Doctoral Researcher
Ubiquitous Knowledge Processing (UKP) Lab
FB 20 / Computer Science Department
Technische Universität Darmstadt
Hochschulstr. 10, D-64289 Darmstadt, Germany
phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111
schno...@ukp.informatik.tu-darmstadt.de
www.ukp.tu-darmstadt.de

Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de
GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources
(AIPHES): www.aiphes.tu-darmstadt.de
PhD program: Knowledge Discovery in Scientific Literature (KDSL)
www.kdsl.tu-darmstadt.de

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is it required to remove checkpoint when submitting a code change?

2015-09-11 Thread Cody Koeninger
Yeah, it makes sense that parameters that are read only during your
getOrCCreate function wouldn't be re-read, since that function isn't called
if a checkpoint is loaded.

I would have thought changing number of executors and other things used by
spark-submit would work on checkpoint restart.  Have you tried both
changing them in the properties file provided to spark submit, and the
--arguments that correspond to number of cores / executor memory?

On Thu, Sep 10, 2015 at 5:23 PM, Ricardo Luis Silva Paiva <
ricardo.pa...@corp.globo.com> wrote:

>
> Hi guys,
>
> I tried to use the configuration file, but it didn't work as I expected.
> As part of the Spark Streaming flow, my methods run only when the
> application is started the first time. Once I restart the app, it reads
> from the checkpoint and all the dstream operations come from the cache. No
> parameter is reloaded.
>
> I would like to know if it's possible to reset the time of windowed
> operations, checkpoint time etc. I also would like to change the submission
> parameters, like number of executors, memory per executor or driver etc. If
> it's not possible, what kind of parameters do you guys usually use in a
> configuration file. I know that the streaming interval it not possible to
> be changed.
>
> This is my code:
>
> def main(args: Array[String]): Unit = {
>   val ssc = StreamingContext.getOrCreate(CHECKPOINT_FOLDER,
> createSparkContext _)
>   ssc.start()
>   ssc.awaitTermination()
>   ssc.stop()
> }
>
> def createSparkContext(): StreamingContext = {
>   val sparkConf = new SparkConf()
>  .setAppName(APP_NAME)
>  .set("spark.streaming.unpersist", "true")
>   val ssc = new StreamingContext(sparkConf, streamingInterval)
>   ssc.checkpoint(CHECKPOINT_FOLDER)
>   ssc.sparkContext.addFile(CONFIG_FILENAME)
>
>   val rawStream = createKafkaRDD(ssc)
>   processAndSave(rawStream)
>   return ssc
> }
>
> def processAndSave(rawStream:DStream[(String, Array[Byte])]): Unit = {
>
>   val configFile = SparkFiles.get("config.properties")
>   val config:Config = ConfigFactory.parseFile(new File(configFile))
>
>
> *  slidingInterval = Minutes(config.getInt("streaming.sliding.interval"))
> windowLength = Minutes(config.getInt("streaming.window.interval"))
> minPageview = config.getInt("streaming.pageview.min")*
>
>
>   val pageviewStream = rawStream.map{ case (_, raw) =>
> (PageViewParser.parseURL(raw), 1L) }
>   val pageviewsHourlyCount = 
> stream.reduceByKeyAndWindow(PageViewAgregator.pageviewsSum
> _,
>
> PageViewAgregator.pageviewsMinus _,
>  *windowLength*,
>  *slidingInterval*
> )
>
>   val permalinkAudienceStream = pageviewsHourlyCount.filter(_._2 >=
> *minPageview*)
>   permalinkAudienceStream.map(a => s"${a._1}\t${a._2}")
>  .repartition(1)
>  .saveAsTextFiles(DESTINATION_FILE, "txt")
>
> }
>
> I really appreciate any help on this.
>
> Many thanks,
>
> Ricardo
>
> On Thu, Sep 3, 2015 at 1:58 PM, Ricardo Luis Silva Paiva <
> ricardo.pa...@corp.globo.com> wrote:
>
>> Good tip. I will try that.
>>
>> Thank you.
>>
>> On Wed, Sep 2, 2015 at 6:54 PM, Cody Koeninger 
>> wrote:
>>
>>> Yeah, in general if you're changing the jar you can't recover the
>>> checkpoint.
>>>
>>> If you're just changing parameters, why not externalize those in a
>>> configuration file so your jar doesn't change?  I tend to stick even my
>>> app-specific parameters in an external spark config so everything is in one
>>> place.
>>>
>>> On Wed, Sep 2, 2015 at 4:48 PM, Ricardo Luis Silva Paiva <
>>> ricardo.pa...@corp.globo.com> wrote:
>>>
 Hi,

 Is there a way to submit an app code change, keeping the checkpoint
 data or do I need to erase the checkpoint folder every time I re-submit the
 spark app with a new jar?

 I have an app that count pageviews streaming from Kafka, and deliver a
 file every hour from the past 24 hours. I'm using reduceByKeyAndWindow with
 the reduce and inverse functions set.

 I'm doing some code improvements and would like to keep the data from
 the past hours, so when I re-submit a code change, I would keep delivering
 the pageviews aggregation without need to wait for 24 hours of new data.
 Sometimes I'm just changing the submission parameters, like number of
 executors, memory and cores.

 Many thanks,

 Ricardo

 --
 Ricardo Paiva
 Big Data / Semântica
 *globo.com* 

>>>
>>>
>>
>>
>> --
>> Ricardo Paiva
>> Big Data / Semântica
>> *globo.com* 
>>
>
>
>
> --
> Ricardo Paiva
> Big Data / Semântica
> *globo.com* 
>


Re: Realtime Data Visualization Tool for Spark

2015-09-11 Thread Feynman Liang
Spark notebook does something similar, take a look at their line chart code


On Fri, Sep 11, 2015 at 8:56 AM, Shashi Vishwakarma <
shashi.vish...@gmail.com> wrote:

> Hi
>
> I have got streaming data which needs to be processed and send for
> visualization.  I am planning to use spark streaming for this but little
> bit confused in choosing visualization tool. I read somewhere that D3.js
> can be used but i wanted know which is best tool for visualization while
> dealing with streaming application.(something that can be easily integrated)
>
> If someone has any link which can tell about D3.js(or any other
> visualization tool) and Spark streaming application integration  then
> please share . That would be great help.
>
>
> Thanks and Regards
> Shashi
>
>


Re: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Davies Liu
On Fri, Sep 11, 2015 at 10:31 AM, Jesse F Chen  wrote:
>
> Thanks Hao!
>
> I tried your suggestion of setting spark.shuffle.reduceLocality.enabled=false 
> and my initial tests showed queries are on par between 1.5 and 1.4.1.
>
> Results:
>
> tpcds-query39b-141.out:query time: 129.106478631 sec
> tpcds-query39b-150-reduceLocality-false.out:query time: 128.854284296 sec
> tpcds-query39b-150.out:query time: 572.443151734 sec
>
> With default  spark.shuffle.reduceLocality.enabled=true, I am seeing 
> across-the-board slow down for majority of the TPCDS queries.
>
> My test is on a bare metal 20-node cluster. I ran the my test as follows:
>
> /TestAutomation/spark-1.5/bin/spark-submit  --master yarn-client  --packages 
> com.databricks:spark-csv_2.10:1.1.0 --name TPCDSSparkSQLHC
> --conf spark.shuffle.reduceLocality.enabled=false
> --executor-memory 4096m --num-executors 100
> --class org.apache.spark.examples.sql.hive.TPCDSSparkSQLHC
> /TestAutomation/databricks/spark-sql-perf-master/target/scala-2.10/tpcdssparksql_2.10-0.9.jar
> hdfs://rhel2.cisco.com:8020/user/bigsql/hadoopds100g
> /TestAutomation/databricks/spark-sql-perf-master/src/main/queries/jesse/query39b.sql
>

Just be curious how the difference would be if you use 20 executors
and 20G memory for each executor. Share the same JVM for some tasks,
could reduce the overhead for codegen and JIT, it may also reduce the
overhead of `reduceLocality`(it can be easier to schedule the tasks).

>
>
>
> "Cheng, Hao" ---09/11/2015 01:00:28 AM---Can you confirm if the query really 
> run in the cluster mode? Not the local mode. Can you print the c
>
> From: "Cheng, Hao" 
> To: Todd 
> Cc: Jesse F Chen/San Francisco/IBM@IBMUS, Michael Armbrust 
> , "user@spark.apache.org" 
> Date: 09/11/2015 01:00 AM
> Subject: RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ 
> compared with spark 1.4.1 SQL
>
> 
>
>
>
> Can you confirm if the query really run in the cluster mode? Not the local 
> mode. Can you print the call stack of the executor when the query is running?
>
> BTW: spark.shuffle.reduceLocality.enabled is the configuration of Spark, not 
> Spark SQL.
>
> From: Todd [mailto:bit1...@163.com]
> Sent: Friday, September 11, 2015 3:39 PM
> To: Todd
> Cc: Cheng, Hao; Jesse F Chen; Michael Armbrust; user@spark.apache.org
> Subject: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ 
> compared with spark 1.4.1 SQL
>
> I add the following two options:
> spark.sql.planner.sortMergeJoin=false
> spark.shuffle.reduceLocality.enabled=false
>
> But it still performs the same as not setting them two.
>
> One thing is that on the spark ui, when I click the SQL tab, it shows an 
> empty page but the header title 'SQL',there is no table to show queries and 
> execution plan information.
>
>
>
>
> At 2015-09-11 14:39:06, "Todd"  wrote:
>
>
> Thanks Hao.
> Yes,it is still low as SMJ。Let me try the option your suggested,
>
>
> At 2015-09-11 14:34:46, "Cheng, Hao"  wrote:
>
> You mean the performance is still slow as the SMJ in Spark 1.5?
>
> Can you set the spark.shuffle.reduceLocality.enabled=false when you start the 
> spark-shell/spark-sql? It’s a new feature in Spark 1.5, and it’s true by 
> default, but we found it probably causes the performance reduce dramatically.
>
>
> From: Todd [mailto:bit1...@163.com]
> Sent: Friday, September 11, 2015 2:17 PM
> To: Cheng, Hao
> Cc: Jesse F Chen; Michael Armbrust; user@spark.apache.org
> Subject: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with 
> spark 1.4.1 SQL
>
> Thanks Hao for the reply.
> I turn the merge sort join off, the physical plan is below, but the 
> performance is roughly the same as it on...
>
> == Physical Plan ==
> TungstenProject 
> [ss_quantity#10,ss_list_price#12,ss_coupon_amt#19,ss_cdemo_sk#4,ss_item_sk#2,ss_promo_sk#8,ss_sold_date_sk#0]
> ShuffledHashJoin [ss_item_sk#2], [ss_item_sk#25], BuildRight
>  TungstenExchange hashpartitioning(ss_item_sk#2)
>   ConvertToUnsafe
>Scan 
> ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_promo_sk#8,ss_quantity#10,ss_cdemo_sk#4,ss_list_price#12,ss_coupon_amt#19,ss_item_sk#2,ss_sold_date_sk#0]
>  TungstenExchange hashpartitioning(ss_item_sk#25)
>   ConvertToUnsafe
>Scan 
> ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_item_sk#25]
>
> Code Generation: true
>
>
>
> At 2015-09-11 13:48:23, "Cheng, Hao"  wrote:
>
> This is not a big surprise the SMJ is slower than the HashJoin, as we do not 
> fully utilize the sorting yet, more details can be found at 
> https://issues.apache.org/jira/browse/SPARK-2926 .
>
> Anyway, can you disable the sort merge join by 
> “spark.sql.planner.sortMergeJoin=false;” in Spark 1.5, and run the query 
> again? In our previous 

Re: Spark on Mesos with Jobs in Cluster Mode Documentation

2015-09-11 Thread Tom Waterhouse (tomwater)
Tim,

Thank you for the explanation.  You are correct, my Mesos experience is very 
light, and I haven’t deployed anything via Marathon yet.  What you have stated 
here makes sense, I will look into doing this.

Adding this info to the docs would be great.  Is the appropriate action to 
create an issue regarding improvement of the docs?  For those of us who are 
gaining the experience having such a pointer is very helpful.

Tom

From: Tim Chen >
Date: Thursday, September 10, 2015 at 10:25 AM
To: Tom Waterhouse >
Cc: "user@spark.apache.org" 
>
Subject: Re: Spark on Mesos with Jobs in Cluster Mode Documentation

Hi Tom,

Sorry the documentation isn't really rich, since it's probably assuming users 
understands how Mesos and framework works.

First I need explain the rationale of why create the dispatcher. If you're not 
familiar with Mesos yet, each node in your datacenter is installed a Mesos 
slave where it's responsible for publishing resources and running/watching 
tasks, and Mesos master is responsible for taking the aggregated resources and 
scheduling them among frameworks.

Frameworks are not managed by Mesos, as Mesos master/slave doesn't launch and 
maintain framework but assume they're launched and kept running on its own. All 
the existing frameworks in the ecosystem therefore all have their own ways to 
deploy, HA and persist state (e.g: Aurora, Marathon, etc).

Therefore, to introduce cluster mode with Mesos, we must create a framework 
that is long running that can be running in your datacenter, and can handle 
launching spark drivers on demand and handle HA, etc. This is what the 
dispatcher is all about.

So the idea is that you should launch the dispatcher not on the client, but on 
a machine in your datacenter. In Mesosphere's DCOS we launch all frameworks and 
long running services with Marathon, and you can use Marathon to launch the 
Spark dispatcher.

Then all clients instead of specifying the Mesos master URL (e.g: 
mesos://mesos.master:2181), then just talks to the dispatcher only 
(mesos://spark-dispatcher.mesos:7077), and the dispatcher will then start and 
watch the driver for you.

Tim



On Thu, Sep 10, 2015 at 10:13 AM, Tom Waterhouse (tomwater) 
> wrote:
After spending most of yesterday scouring the Internet for sources of 
documentation for submitting Spark jobs in cluster mode to a Spark cluster 
managed by Mesos I was able to do just that, but I am not convinced that how I 
have things setup is correct.

I used the Mesos 
published 
instructions for setting up my Mesos cluster.  I have three Zookeeper 
instances, three Mesos master instances, and three Mesos slave instances.  This 
is all running in Openstack.

The documentation on the Spark documentation site states that “To use cluster 
mode, you must start the MesosClusterDispatcher in your cluster via the 
sbin/start-mesos-dispatcher.sh script, passing in the Mesos master url (e.g: 
mesos://host:5050).”  That is it, no more information than that.  So that is 
what I did: I have one machine that I use as the Spark client for submitting 
jobs.  I started the Mesos dispatcher with script as described, and using the 
client machine’s IP address and port as the target for the job submitted the 
job.

The job is currently running in Mesos as expected.  This is not however how I 
would have expected to configure the system.  As running there is one instance 
of the Spark Mesos dispatcher running outside of Mesos, so not a part of the 
sphere of Mesos resource management.

I used the following Stack Overflow posts as guidelines:
http://stackoverflow.com/questions/31164725/spark-mesos-dispatcher
http://stackoverflow.com/questions/31294515/start-spark-via-mesos

There must be better documentation on how to deploy Spark in Mesos with jobs 
able to be deployed in cluster mode.

I can follow up with more specific information regarding my deployment if 
necessary.

Tom



Re: Implement "LIKE" in SparkSQL

2015-09-11 Thread Richard Eggert
concat and locate are available as of version 1.5.0, according to the
Scaladocs. For earlier versions of Spark, and for the operations that are
still not supported,  it's pretty straightforward to define your own
UserDefinedFunctions in either Scala or Java  (I don't know about other
languages).
On Sep 11, 2015 10:26 PM, "liam"  wrote:

> Hi,
>
>  Imaging this: the value of one column is the substring of another
> column, when using Oracle,I got many ways to do the query like the
> following statement,but how to do in SparkSQL since this no "concat(),
> instr(), locate()..."
>
>
> select * from table t where t.a like '%'||t.b||'%';
>
>
> Thanks.
>
>


Re: Is there any Spark SQL reference manual?

2015-09-11 Thread Ted Yu
You may have seen this:
https://spark.apache.org/docs/latest/sql-programming-guide.html

Please suggest what should be added.

Cheers

On Fri, Sep 11, 2015 at 3:43 AM, vivek bhaskar  wrote:

> Hi all,
>
> I am looking for a reference manual for Spark SQL some thing like many
> database vendors have. I could find one for hive ql
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual but not
> anything specific to spark sql.
>
> Please suggest. SQL reference specific to latest release will be of great
> help.
>
> Regards,
> Vivek
>
>


Re: Is there any Spark SQL reference manual?

2015-09-11 Thread vivek bhaskar
Hi Ted,

The link you mention do not have complete list of supported syntax. For
example, few supported syntax are listed as "Supported Hive features" but
that do not claim to be exhaustive (even if it is so, one has to filter out
a lot many lines from Hive QL reference and still will not be sure if its
all - due to versions mismatch).

Quickly searching online gives me link for another popular open source
project which has good sql reference:
https://db.apache.org/derby/docs/10.1/ref/crefsqlj23296.html.

I had similar expectation when I was looking for all supported DDL and DML
syntax along with their extensions. For example,
a. Select expression along with supported extensions i.e. where clause,
group by, different supported joins etc.
b. SQL format for Create, Insert, Alter table etc.
c. SQL for Insert, Update, Delete, etc along with their extensions.
d. Syntax for view creation, if supported
e. Syntax for explain mechanism
f. List of supported functions, operators, etc. I can see that 100s of
function are added in 1.5 but then you have to make lot of cross check from
code to JIRA tickets.

So I wanted a piece of documentation that can provide all such information
at a single place.

Regards,
Vivek





On Fri, Sep 11, 2015 at 4:29 PM, Ted Yu  wrote:

> You may have seen this:
> https://spark.apache.org/docs/latest/sql-programming-guide.html
>
> Please suggest what should be added.
>
> Cheers
>
> On Fri, Sep 11, 2015 at 3:43 AM, vivek bhaskar 
> wrote:
>
>> Hi all,
>>
>> I am looking for a reference manual for Spark SQL some thing like many
>> database vendors have. I could find one for hive ql
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual but not
>> anything specific to spark sql.
>>
>> Please suggest. SQL reference specific to latest release will be of great
>> help.
>>
>> Regards,
>> Vivek
>>
>>
>


Re: Spark does not yet support its JDBC component for Scala 2.11.

2015-09-11 Thread Ted Yu
Have you looked at:
https://issues.apache.org/jira/browse/SPARK-8013



> On Sep 11, 2015, at 4:53 AM, Petr Novak  wrote:
> 
> Does it still apply for 1.5.0?
> 
> What actual limitation does it mean when I switch to 2.11? No JDBC 
> Thriftserver? No JDBC DataSource? No JdbcRDD (which is already obsolete I 
> believe)? Some more?
> 
> What library is the blocker to upgrade JDBC component to 2.11?
> 
> Is there any estimate when it could be available for 2.11?
> 
> Many thanks,
> Petr

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



selecting columns with the same name in a join

2015-09-11 Thread Evert Lammerts
Am I overlooking something? This doesn't seem right:

x = sc.parallelize([dict(k=1, v="Evert"), dict(k=2, v="Erik")]).toDF()
y = sc.parallelize([dict(k=1, v="Ruud"), dict(k=3, v="Vincent")]).toDF()
x.registerTempTable('x')
y.registerTempTable('y')
sqlContext.sql("select y.v, x.v FROM x INNER JOIN y ON x.k=y.k").collect()

Out[26]: [Row(v=u'Evert', v=u'Evert')]

May just be because I'm behind; I'm on:

Spark 1.5.0-SNAPSHOT (git revision 27ef854) built for Hadoop 2.6.0 Build
flags: -Pyarn -Psparkr -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive
-Phive-thriftserver -DskipTests

Can somebody check whether the above code does work on the latest release?

Thanks!
Evert


Re: Spark 1.5.0 java.lang.OutOfMemoryError: PermGen space

2015-09-11 Thread Ted Yu
Have you seen this thread ?

http://search-hadoop.com/m/q3RTtPPuSvBu0rj2


> On Sep 11, 2015, at 3:00 AM, Jagat Singh  wrote:
> 
> Hi,
> 
> We have queries which were running fine on 1.4.1 system.
> 
> We are testing upgrade and even simple query like
> val t1= sqlContext.sql("select count(*) from table")
> 
> t1.show
> 
> This works perfectly fine on 1.4.1 but throws OOM error in 1.5.0
> 
> Are there any changes in default memory settings from 1.4.1 to 1.5.0
> 
> Thanks,
> 
> 
> 
> 


Is there any Spark SQL reference manual?

2015-09-11 Thread vivek bhaskar
Hi all,

I am looking for a reference manual for Spark SQL some thing like many
database vendors have. I could find one for hive ql
https://cwiki.apache.org/confluence/display/Hive/LanguageManual but not
anything specific to spark sql.

Please suggest. SQL reference specific to latest release will be of great
help.

Regards,
Vivek


Spark does not yet support its JDBC component for Scala 2.11.

2015-09-11 Thread Petr Novak
Does it still apply for 1.5.0?

What actual limitation does it mean when I switch to 2.11? No JDBC
Thriftserver? No JDBC DataSource? No JdbcRDD (which is already obsolete I
believe)? Some more?

What library is the blocker to upgrade JDBC component to 2.11?

Is there any estimate when it could be available for 2.11?

Many thanks,
Petr


java.util.NoSuchElementException: key not found

2015-09-11 Thread guoqing0...@yahoo.com.hk
Hi all , 
After upgrade spark to 1.5 ,  Streaming throw java.util.NoSuchElementException: 
key not found occasionally , is the problem of data cause this error ?  please 
help me if anyone got similar problem before , Thanks very much.

the exception accur when write into database.


org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 
76, slave2): java.util.NoSuchElementException: key not found: 
ruixue.sys.session.request
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
at 
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
at 
org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:152)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:152)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:152)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120)
at 
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)



guoqing0...@yahoo.com.hk


Re: Spark based Kafka Producer

2015-09-11 Thread Atul Kulkarni
Slight update:

The following code with "spark context" works, with wild card file paths in
hard coded strings but it won't work with a value parsed out of the program
arguments as above:

val sc = new SparkContext(sparkConf)
val zipFileTextRDD =
sc.textFile("/data/raw/logs/2015-09-01/home/logs/access_logs/*/2015-09-01/alog.2015-09-01-18-*.gz",
100)

zipFileTextRDD.map {
  case (text: String) => {
text
  }
}.saveAsTextFile("TextFileForGZips")

The code with spark "streaming context" won't work with this hard coded
path either.

val ssc = new StreamingContext(sparkConf, Seconds(2))

val zipFileDStreams =
ssc.textFileStream("/data/raw/logs/2015-09-01/home/logs/access_logs/*/2015-09-01/alog.2015-09-01-18-*.gz")
zipFileDStreams.foreachRDD {
  rdd =>
rdd.map {
  logLineText => {
  println(logLineText)
  logLineText
  //producerObj.value.send(topics, logLineText)
  }
}.saveAsTextFile("TextFileForGZips")
}

Not sure if it the DStream created from textFileStream is a problem or
something else.

But this is something I am not able to explain. Any clarity on what is
really happening inside would be helpful in understanding the working so
that I don't make such mistake again.

Regards,
Atul.



On Fri, Sep 11, 2015 at 11:32 AM, Atul Kulkarni 
wrote:

> Folks,
>
> Any help on this?
>
> Regards,
> Atul.
>
>
> On Fri, Sep 11, 2015 at 8:39 AM, Atul Kulkarni 
> wrote:
>
>> Hi Raghavendra,
>>
>> Thanks for your answers, I am passing 10 executors and I am not sure if
>> that is the problem. It is still hung.
>>
>> Regards,
>> Atul.
>>
>>
>> On Fri, Sep 11, 2015 at 12:40 AM, Raghavendra Pandey <
>> raghavendra.pan...@gmail.com> wrote:
>>
>>> You can pass the number of executors via command line option
>>> --num-executors.You need more than 2 executors to make spark-streaming
>>> working.
>>>
>>> For more details on command line option, please go through
>>> http://spark.apache.org/docs/latest/running-on-yarn.html.
>>>
>>>
>>> On Fri, Sep 11, 2015 at 10:52 AM, Atul Kulkarni >> > wrote:
>>>
 I am submitting the job with yarn-cluster mode.

 spark-submit --master yarn-cluster ...

 On Thu, Sep 10, 2015 at 7:50 PM, Raghavendra Pandey <
 raghavendra.pan...@gmail.com> wrote:

> What is the value of spark master conf.. By default it is local, that
> means only one thread can run and that is why your job is stuck.
> Specify it local[*], to make thread pool equal to number of cores...
>
> Raghav
> On Sep 11, 2015 6:06 AM, "Atul Kulkarni" 
> wrote:
>
>> Hi Folks,
>>
>> Below is the code  have for Spark based Kafka Producer to take
>> advantage of multiple executors reading files in parallel on my cluster 
>> but
>> I am stuck at The program not making any progress.
>>
>> Below is my scrubbed code:
>>
>> val sparkConf = new SparkConf().setAppName(applicationName)
>> val ssc = new StreamingContext(sparkConf, Seconds(2))
>>
>> val producerObj = ssc.sparkContext.broadcast(KafkaSink(kafkaProperties))
>>
>> val zipFileDStreams = ssc.textFileStream(inputFiles)
>> zipFileDStreams.foreachRDD {
>>   rdd =>
>> rdd.foreachPartition(
>>   partition => {
>> partition.foreach{
>>   case (logLineText) =>
>> println(logLineText)
>> producerObj.value.send(topics, logLineText)
>> }
>>   }
>> )
>> }
>>
>> ssc.start()
>> ssc.awaitTermination()
>>
>> ssc.stop()
>>
>> The code for KafkaSink is as follows.
>>
>> class KafkaSink(createProducer: () => KafkaProducer[Array[Byte], 
>> Array[Byte]]) extends Serializable {
>>
>>   lazy val producer = createProducer()
>>   val logParser = new LogParser()
>>
>>   def send(topic: String, value: String): Unit = {
>>
>> val logLineBytes = 
>> Bytes.toBytes(logParser.avroEvent(value.split("\t")).toString)
>> producer.send(new ProducerRecord[Array[Byte], Array[Byte]](topic, 
>> logLineBytes))
>>   }
>> }
>>
>> object KafkaSink {
>>   def apply(config: Properties): KafkaSink = {
>>
>> val f = () => {
>>   val producer = new KafkaProducer[Array[Byte], Array[Byte]](config, 
>> null, null)
>>
>>   sys.addShutdownHook {
>> producer.close()
>>   }
>>   producer
>> }
>>
>> new KafkaSink(f)
>>   }
>> }
>>
>> Disclaimer: it is based on the code inspired by
>> http://allegro.tech/spark-kafka-integration.html.
>>
>> The job just sits there I cannot see any Job Stages being created.
>> Something I want to mention - I I am trying to read gzipped files from 
>> HDFS
>> - could it be that Streaming context is not able to read *.gz files?

Re: Multilabel classification support

2015-09-11 Thread Alexis Gillain
Do you mean by running a model on every label ?
That's another solution of course.

If you mean LogisticRegression natively "supports" multilabel, can you
provide me some references. From what I see in the code it uses LabeledPoint
which has only one label.

2015-09-11 21:54 GMT+08:00 Yanbo Liang :

> LogisticRegression in MLlib(not ML) package supports both multiclass and 
> multilabel classification.
>
>
> 2015-09-11 16:21 GMT+08:00 Alexis Gillain :
>
>> You can try these packages for adaboost.mh :
>>
>> https://github.com/BaiGang/spark_multiboost (scala)
>> or
>> https://github.com/tizfa/sparkboost (java)
>>
>>
>> 2015-09-11 15:29 GMT+08:00 Yasemin Kaya :
>>
>>> Hi,
>>>
>>> I want to use Mllib for multilabel classification, but I find
>>> http://spark.apache.org/docs/latest/mllib-classification-regression.html,
>>> it is not what I mean. Is there a way to use  multilabel classification?
>>> Thanks alot.
>>>
>>> Best,
>>> yasemin
>>>
>>> --
>>> hiç ender hiç
>>>
>>
>>
>>
>> --
>> Alexis GILLAIN
>>
>
>


-- 
Alexis GILLAIN


Re: countApproxDistinctByKey in python

2015-09-11 Thread Ted Yu
It has not been ported yet.

On Fri, Sep 11, 2015 at 4:13 PM, LucaMartinetti  wrote:

> Hi,
>
> I am trying to use countApproxDistinctByKey in pyspark but cannot find it.
>
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L417
>
> Am I missing something or has not been ported / wrapped yet?
>
> Thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/countApproxDistinctByKey-in-python-tp24663.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Model summary for linear and logistic regression.

2015-09-11 Thread Sebastian Kuepers
Hey,


the 1.5.0 release note say, that there are now model summaries for logistic 
regressions available.

But I can't find them in the current documentary.

?

Any help very much appreciated!

Thanks


Sebastian





Disclaimer The information in this email and any attachments may contain 
proprietary and confidential information that is intended for the addressee(s) 
only. If you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution, retention or use of the contents of this 
information is prohibited. When addressed to our clients or vendors, any 
information contained in this e-mail or any attachments is subject to the terms 
and conditions in any governing contract. If you have received this e-mail in 
error, please immediately contact the sender and delete the e-mail.


Fwd: MLlib LDA implementation questions

2015-09-11 Thread Marko Asplund
Hi,

We're considering using Spark MLlib (v >= 1.5) LDA implementation for topic
modelling. We plan to train the model using a data set of about 12 M
documents and vocabulary size of 200-300 k items. Documents are relatively
short, typically containing less than 10 words, but the number can range up
to tens of words. The model would be updated periodically using e.g. a
batch process while predictions will be queried by a long-running
application process in which we plan to embed MLlib.

Is the MLlib LDA implementation considered to be well-suited to this kind
of use case?

I did some prototyping based on the code samples on "MLlib - Clustering"
page and noticed that the topics matrix values seem to vary quite a bit
across training runs even with the exact same input data set. During
prediction I observed similar behaviour.
Is this due to the probabilistic nature of the LDA algorithm?

Any caveats to be aware of with the LDA implementation?

For reference, my prototype code can be found here:
https://github.com/marko-asplund/tech-protos/blob/master/mllib-lda/src/main/scala/fi/markoa/proto/mllib/LDADemo.scala


thanks,
marko


Re: Exception Handling : Spark Streaming

2015-09-11 Thread Ted Yu
Was your intention that exception from rdd.saveToCassandra() be caught ?
In that case you can place try / catch around that call.

Cheers

On Fri, Sep 11, 2015 at 7:30 AM, Samya  wrote:

> Hi Team,
>
> I am facing this issue where in I can't figure out why the exception is
> handled the first time an exception is thrown in the stream processing
> action, but is ignored the second time.
>
> PFB my code base.
>
>
>
> object Boot extends App {
>
>   //Load the configuration
>   val config = LoadConfig.getConfig
>
>   val checkpointDirectory =
> config.getString("nexti.spark.checkpointDirectory")
>   var ssc: StreamingContext = null
>   try{
> val sparkConf = new SparkConf()
>   .setAppName(Boot.config.getString("nexti.spark.appNme"))
>   .setMaster(Boot.config.getString("nexti.spark.master"))
>   .set("spark.cassandra.connection.host",
> Boot.config.getString("nexti.spark.cassandra.connection.host"))
>
>
> .set("spark.cassandra.query.retry.count",Boot.config.getString("nexti.spark.cassandra.query.retry.count"))
>
> val ssc = new StreamingContext(sparkConf, Seconds(30))
>
> val msgStream = SparkKafkaDirectAPI.getStream(ssc)
>
> val wordCountPair = msgStream.map(_._2).flatMap(_.split(" ")).map(x =>
> (x, 1L)).reduceByKey(_ + _)
>
> wordCountPair.foreachRDD(rdd =>
> rdd.saveToCassandra("nexti","direct_api_test",AllColumns))
>
> ssc.start()
> ssc.awaitTermination()
>   }
>   catch {
> case ex: Exception =>{
>   println(" Exception UNKNOWN Only.")
> }
>   }
> }
>
>
> I am sure that missing out on something, please provide your inputs.
>
> Regards,
> Sam
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Exception-Handling-Spark-Streaming-tp24658.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: MongoDB and Spark

2015-09-11 Thread Corey Nolet
Unfortunately, MongoDB does not directly expose its locality via its client
API so the problem with trying to schedule Spark tasks against it is that
the tasks themselves cannot be scheduled locally on nodes containing query
results- which means you can only assume most results will be sent over the
network to the task that needs to process it. This is bad. The other reason
(which is also related to the issue of locality) is that I'm not sure if
there's an easy way to spread the results of a query over multiple
different clients- thus you'd probably have to start your Spark RDD with a
single partition and then repartition. What you've done at that point is
you've taken data from multiple mongodb nodes and you've collected them on
a single node just to re-partition them, again across the network, onto
multiple nodes. This is also bad.

I think this is the reason it was recommended to use MongoDB's mapreduce
because they can use their locality information internally. I had this same
issue w/ Couchbase a couple years back- it's unfortunate but it's the
reality.




On Fri, Sep 11, 2015 at 9:34 AM, Sandeep Giri 
wrote:

> I think it should be possible by loading collections as RDD and then doing
> a union on them.
>
> Regards,
> Sandeep Giri,
> +1 347 781 4573 (US)
> +91-953-899-8962 (IN)
>
> www.KnowBigData.com. 
> Phone: +1-253-397-1945 (Office)
>
> [image: linkedin icon]  [image:
> other site icon]   [image: facebook icon]
>  [image: twitter icon]
>  
>
>
> On Fri, Sep 11, 2015 at 3:40 PM, Mishra, Abhishek <
> abhishek.mis...@xerox.com> wrote:
>
>> Anything using Spark RDD’s ???
>>
>>
>>
>> Abhishek
>>
>>
>>
>> *From:* Sandeep Giri [mailto:sand...@knowbigdata.com]
>> *Sent:* Friday, September 11, 2015 3:19 PM
>> *To:* Mishra, Abhishek; user@spark.apache.org; d...@spark.apache.org
>> *Subject:* Re: MongoDB and Spark
>>
>>
>>
>> use map-reduce.
>>
>>
>>
>> On Fri, Sep 11, 2015, 14:32 Mishra, Abhishek 
>> wrote:
>>
>> Hello ,
>>
>>
>>
>> Is there any way to query multiple collections from mongodb using spark
>> and java.  And i want to create only one Configuration Object. Please help
>> if anyone has something regarding this.
>>
>>
>>
>>
>>
>> Thank You
>>
>> Abhishek
>>
>>
>


Exception Handling : Spark Streaming

2015-09-11 Thread Samya
Hi Team, 

I am facing this issue where in I can't figure out why the exception is
handled the first time an exception is thrown in the stream processing
action, but is ignored the second time.

PFB my code base.



object Boot extends App {

  //Load the configuration
  val config = LoadConfig.getConfig

  val checkpointDirectory =
config.getString("nexti.spark.checkpointDirectory")
  var ssc: StreamingContext = null
  try{
val sparkConf = new SparkConf()
  .setAppName(Boot.config.getString("nexti.spark.appNme"))
  .setMaster(Boot.config.getString("nexti.spark.master"))
  .set("spark.cassandra.connection.host",
Boot.config.getString("nexti.spark.cassandra.connection.host"))
 
.set("spark.cassandra.query.retry.count",Boot.config.getString("nexti.spark.cassandra.query.retry.count"))

val ssc = new StreamingContext(sparkConf, Seconds(30))

val msgStream = SparkKafkaDirectAPI.getStream(ssc)

val wordCountPair = msgStream.map(_._2).flatMap(_.split(" ")).map(x =>
(x, 1L)).reduceByKey(_ + _)

wordCountPair.foreachRDD(rdd =>
rdd.saveToCassandra("nexti","direct_api_test",AllColumns))

ssc.start()
ssc.awaitTermination()
  }
  catch {
case ex: Exception =>{
  println(" Exception UNKNOWN Only.")
}
  }
}


I am sure that missing out on something, please provide your inputs.

Regards,
Sam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Exception-Handling-Spark-Streaming-tp24658.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



A way to kill laggard jobs?

2015-09-11 Thread Dmitry Goldenberg
Is there a way to kill a laggard Spark job manually, and more importantly,
is there a way to do it programmatically based on a configurable timeout
value?

Thanks.


Re: Is there any Spark SQL reference manual?

2015-09-11 Thread Richard Hillegas

The latest Derby SQL Reference manual (version 10.11) can be found here:
https://db.apache.org/derby/docs/10.11/ref/index.html. It is, indeed, very
useful to have a comprehensive reference guide. The Derby build scripts can
also produce a BNF description of the grammar--but that is not part of the
public documentation for the project. The BNF is trivial to generate
because it is an artifact of the JavaCC grammar generator which Derby uses.

I appreciate the difficulty of maintaining a formal reference guide for a
rapidly evolving SQL dialect like Spark's.

A machine-generated BNF, however, is easy to imagine. But perhaps not so
easy to implement. Spark's SQL grammar is implemented in Scala, extending
the DSL support provided by the Scala language. I am new to programming in
Scala, so I don't know whether the Scala ecosystem provides any good tools
for reverse-engineering a BNF from a class which extends
scala.util.parsing.combinator.syntactical.StandardTokenParsers.

Thanks,
-Rick

vivekw...@gmail.com wrote on 09/11/2015 05:05:47 AM:

> From: vivek bhaskar 
> To: Ted Yu 
> Cc: user 
> Date: 09/11/2015 05:06 AM
> Subject: Re: Is there any Spark SQL reference manual?
> Sent by: vivekw...@gmail.com
>
> Hi Ted,
>
> The link you mention do not have complete list of supported syntax.
> For example, few supported syntax are listed as "Supported Hive
> features" but that do not claim to be exhaustive (even if it is so,
> one has to filter out a lot many lines from Hive QL reference and
> still will not be sure if its all - due to versions mismatch).
>
> Quickly searching online gives me link for another popular open
> source project which has good sql reference: https://db.apache.org/
> derby/docs/10.1/ref/crefsqlj23296.html.
>
> I had similar expectation when I was looking for all supported DDL
> and DML syntax along with their extensions. For example,
> a. Select expression along with supported extensions i.e. where
> clause, group by, different supported joins etc.
> b. SQL format for Create, Insert, Alter table etc.
> c. SQL for Insert, Update, Delete, etc along with their extensions.
> d. Syntax for view creation, if supported
> e. Syntax for explain mechanism
> f. List of supported functions, operators, etc. I can see that 100s
> of function are added in 1.5 but then you have to make lot of cross
> check from code to JIRA tickets.
>
> So I wanted a piece of documentation that can provide all such
> information at a single place.
>
> Regards,
> Vivek
>
> On Fri, Sep 11, 2015 at 4:29 PM, Ted Yu  wrote:
> You may have seen this:
> https://spark.apache.org/docs/latest/sql-programming-guide.html
>
> Please suggest what should be added.
>
> Cheers
>
> On Fri, Sep 11, 2015 at 3:43 AM, vivek bhaskar 
wrote:
> Hi all,
>
> I am looking for a reference manual for Spark SQL some thing like
> many database vendors have. I could find one for hive ql https://
> cwiki.apache.org/confluence/display/Hive/LanguageManual but not
> anything specific to spark sql.
>
> Please suggest. SQL reference specific to latest release will be of
> great help.
>
> Regards,
> Vivek

Re: Is there any Spark SQL reference manual?

2015-09-11 Thread Peyman Mohajerian
http://docs.datastax.com/en/datastax_enterprise/4.6/datastax_enterprise/spark/sparkSqlSupportedSyntax.html

On Fri, Sep 11, 2015 at 8:15 AM, Richard Hillegas 
wrote:

> The latest Derby SQL Reference manual (version 10.11) can be found here:
> https://db.apache.org/derby/docs/10.11/ref/index.html. It is, indeed,
> very useful to have a comprehensive reference guide. The Derby build
> scripts can also produce a BNF description of the grammar--but that is not
> part of the public documentation for the project. The BNF is trivial to
> generate because it is an artifact of the JavaCC grammar generator which
> Derby uses.
>
> I appreciate the difficulty of maintaining a formal reference guide for a
> rapidly evolving SQL dialect like Spark's.
>
> A machine-generated BNF, however, is easy to imagine. But perhaps not so
> easy to implement. Spark's SQL grammar is implemented in Scala, extending
> the DSL support provided by the Scala language. I am new to programming in
> Scala, so I don't know whether the Scala ecosystem provides any good tools
> for reverse-engineering a BNF from a class which extends
> scala.util.parsing.combinator.syntactical.StandardTokenParsers.
>
> Thanks,
> -Rick
>
> vivekw...@gmail.com wrote on 09/11/2015 05:05:47 AM:
>
> > From: vivek bhaskar 
> > To: Ted Yu 
> > Cc: user 
> > Date: 09/11/2015 05:06 AM
> > Subject: Re: Is there any Spark SQL reference manual?
> > Sent by: vivekw...@gmail.com
>
> >
> > Hi Ted,
> >
> > The link you mention do not have complete list of supported syntax.
> > For example, few supported syntax are listed as "Supported Hive
> > features" but that do not claim to be exhaustive (even if it is so,
> > one has to filter out a lot many lines from Hive QL reference and
> > still will not be sure if its all - due to versions mismatch).
> >
> > Quickly searching online gives me link for another popular open
> > source project which has good sql reference: https://db.apache.org/
> > derby/docs/10.1/ref/crefsqlj23296.html.
> >
> > I had similar expectation when I was looking for all supported DDL
> > and DML syntax along with their extensions. For example,
> > a. Select expression along with supported extensions i.e. where
> > clause, group by, different supported joins etc.
> > b. SQL format for Create, Insert, Alter table etc.
> > c. SQL for Insert, Update, Delete, etc along with their extensions.
> > d. Syntax for view creation, if supported
> > e. Syntax for explain mechanism
> > f. List of supported functions, operators, etc. I can see that 100s
> > of function are added in 1.5 but then you have to make lot of cross
> > check from code to JIRA tickets.
> >
> > So I wanted a piece of documentation that can provide all such
> > information at a single place.
> >
> > Regards,
> > Vivek
> >
> > On Fri, Sep 11, 2015 at 4:29 PM, Ted Yu  wrote:
> > You may have seen this:
> > https://spark.apache.org/docs/latest/sql-programming-guide.html
> >
> > Please suggest what should be added.
> >
> > Cheers
> >
> > On Fri, Sep 11, 2015 at 3:43 AM, vivek bhaskar 
> wrote:
> > Hi all,
> >
> > I am looking for a reference manual for Spark SQL some thing like
> > many database vendors have. I could find one for hive ql https://
> > cwiki.apache.org/confluence/display/Hive/LanguageManual but not
> > anything specific to spark sql.
> >
> > Please suggest. SQL reference specific to latest release will be of
> > great help.
> >
> > Regards,
> > Vivek
>
>


Re: Spark based Kafka Producer

2015-09-11 Thread Atul Kulkarni
Hi Raghavendra,

Thanks for your answers, I am passing 10 executors and I am not sure if
that is the problem. It is still hung.

Regards,
Atul.


On Fri, Sep 11, 2015 at 12:40 AM, Raghavendra Pandey <
raghavendra.pan...@gmail.com> wrote:

> You can pass the number of executors via command line option
> --num-executors.You need more than 2 executors to make spark-streaming
> working.
>
> For more details on command line option, please go through
> http://spark.apache.org/docs/latest/running-on-yarn.html.
>
>
> On Fri, Sep 11, 2015 at 10:52 AM, Atul Kulkarni 
> wrote:
>
>> I am submitting the job with yarn-cluster mode.
>>
>> spark-submit --master yarn-cluster ...
>>
>> On Thu, Sep 10, 2015 at 7:50 PM, Raghavendra Pandey <
>> raghavendra.pan...@gmail.com> wrote:
>>
>>> What is the value of spark master conf.. By default it is local, that
>>> means only one thread can run and that is why your job is stuck.
>>> Specify it local[*], to make thread pool equal to number of cores...
>>>
>>> Raghav
>>> On Sep 11, 2015 6:06 AM, "Atul Kulkarni" 
>>> wrote:
>>>
 Hi Folks,

 Below is the code  have for Spark based Kafka Producer to take
 advantage of multiple executors reading files in parallel on my cluster but
 I am stuck at The program not making any progress.

 Below is my scrubbed code:

 val sparkConf = new SparkConf().setAppName(applicationName)
 val ssc = new StreamingContext(sparkConf, Seconds(2))

 val producerObj = ssc.sparkContext.broadcast(KafkaSink(kafkaProperties))

 val zipFileDStreams = ssc.textFileStream(inputFiles)
 zipFileDStreams.foreachRDD {
   rdd =>
 rdd.foreachPartition(
   partition => {
 partition.foreach{
   case (logLineText) =>
 println(logLineText)
 producerObj.value.send(topics, logLineText)
 }
   }
 )
 }

 ssc.start()
 ssc.awaitTermination()

 ssc.stop()

 The code for KafkaSink is as follows.

 class KafkaSink(createProducer: () => KafkaProducer[Array[Byte], 
 Array[Byte]]) extends Serializable {

   lazy val producer = createProducer()
   val logParser = new LogParser()

   def send(topic: String, value: String): Unit = {

 val logLineBytes = 
 Bytes.toBytes(logParser.avroEvent(value.split("\t")).toString)
 producer.send(new ProducerRecord[Array[Byte], Array[Byte]](topic, 
 logLineBytes))
   }
 }

 object KafkaSink {
   def apply(config: Properties): KafkaSink = {

 val f = () => {
   val producer = new KafkaProducer[Array[Byte], Array[Byte]](config, 
 null, null)

   sys.addShutdownHook {
 producer.close()
   }
   producer
 }

 new KafkaSink(f)
   }
 }

 Disclaimer: it is based on the code inspired by
 http://allegro.tech/spark-kafka-integration.html.

 The job just sits there I cannot see any Job Stages being created.
 Something I want to mention - I I am trying to read gzipped files from HDFS
 - could it be that Streaming context is not able to read *.gz files?


 I am not sure what more details I can provide to help explain my
 problem.


 --
 Regards,
 Atul Kulkarni

>>>
>>
>>
>> --
>> Regards,
>> Atul Kulkarni
>>
>
>


-- 
Regards,
Atul Kulkarni


Re: Is there any Spark SQL reference manual?

2015-09-11 Thread Ted Yu
Very nice suggestion, Richard.

I logged SPARK-10561 referencing this discussion.

On Fri, Sep 11, 2015 at 8:15 AM, Richard Hillegas 
wrote:

> The latest Derby SQL Reference manual (version 10.11) can be found here:
> https://db.apache.org/derby/docs/10.11/ref/index.html. It is, indeed,
> very useful to have a comprehensive reference guide. The Derby build
> scripts can also produce a BNF description of the grammar--but that is not
> part of the public documentation for the project. The BNF is trivial to
> generate because it is an artifact of the JavaCC grammar generator which
> Derby uses.
>
> I appreciate the difficulty of maintaining a formal reference guide for a
> rapidly evolving SQL dialect like Spark's.
>
> A machine-generated BNF, however, is easy to imagine. But perhaps not so
> easy to implement. Spark's SQL grammar is implemented in Scala, extending
> the DSL support provided by the Scala language. I am new to programming in
> Scala, so I don't know whether the Scala ecosystem provides any good tools
> for reverse-engineering a BNF from a class which extends
> scala.util.parsing.combinator.syntactical.StandardTokenParsers.
>
> Thanks,
> -Rick
>
> vivekw...@gmail.com wrote on 09/11/2015 05:05:47 AM:
>
> > From: vivek bhaskar 
> > To: Ted Yu 
> > Cc: user 
> > Date: 09/11/2015 05:06 AM
> > Subject: Re: Is there any Spark SQL reference manual?
> > Sent by: vivekw...@gmail.com
>
> >
> > Hi Ted,
> >
> > The link you mention do not have complete list of supported syntax.
> > For example, few supported syntax are listed as "Supported Hive
> > features" but that do not claim to be exhaustive (even if it is so,
> > one has to filter out a lot many lines from Hive QL reference and
> > still will not be sure if its all - due to versions mismatch).
> >
> > Quickly searching online gives me link for another popular open
> > source project which has good sql reference: https://db.apache.org/
> > derby/docs/10.1/ref/crefsqlj23296.html.
> >
> > I had similar expectation when I was looking for all supported DDL
> > and DML syntax along with their extensions. For example,
> > a. Select expression along with supported extensions i.e. where
> > clause, group by, different supported joins etc.
> > b. SQL format for Create, Insert, Alter table etc.
> > c. SQL for Insert, Update, Delete, etc along with their extensions.
> > d. Syntax for view creation, if supported
> > e. Syntax for explain mechanism
> > f. List of supported functions, operators, etc. I can see that 100s
> > of function are added in 1.5 but then you have to make lot of cross
> > check from code to JIRA tickets.
> >
> > So I wanted a piece of documentation that can provide all such
> > information at a single place.
> >
> > Regards,
> > Vivek
> >
> > On Fri, Sep 11, 2015 at 4:29 PM, Ted Yu  wrote:
> > You may have seen this:
> > https://spark.apache.org/docs/latest/sql-programming-guide.html
> >
> > Please suggest what should be added.
> >
> > Cheers
> >
> > On Fri, Sep 11, 2015 at 3:43 AM, vivek bhaskar 
> wrote:
> > Hi all,
> >
> > I am looking for a reference manual for Spark SQL some thing like
> > many database vendors have. I could find one for hive ql https://
> > cwiki.apache.org/confluence/display/Hive/LanguageManual but not
> > anything specific to spark sql.
> >
> > Please suggest. SQL reference specific to latest release will be of
> > great help.
> >
> > Regards,
> > Vivek
>
>


Realtime Data Visualization Tool for Spark

2015-09-11 Thread Shashi Vishwakarma
Hi

I have got streaming data which needs to be processed and send for
visualization.  I am planning to use spark streaming for this but little
bit confused in choosing visualization tool. I read somewhere that D3.js
can be used but i wanted know which is best tool for visualization while
dealing with streaming application.(something that can be easily integrated)

If someone has any link which can tell about D3.js(or any other
visualization tool) and Spark streaming application integration  then
please share . That would be great help.


Thanks and Regards
Shashi


RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Jesse F Chen

  Thanks Hao!

  I tried your suggestion of setting spark.shuffle.reduceLocality.enabled
=false and my initial tests showed queries are on par between 1.5 and
1.4.1.

  Results:

tpcds-query39b-141.out:query time: 129.106478631 sec
tpcds-query39b-150-reduceLocality-false.out:query time: 128.854284296 sec
tpcds-query39b-150.out:query time: 572.443151734 sec

With default  spark.shuffle.reduceLocality.enabled=true, I am seeing
across-the-board slow down for majority of the TPCDS queries.

My test is on a bare metal 20-node cluster. I ran the my test as follows:

/TestAutomation/spark-1.5/bin/spark-submit  --master yarn-client
--packages com.databricks:spark-csv_2.10:1.1.0 --name TPCDSSparkSQLHC
--conf spark.shuffle.reduceLocality.enabled=false
--executor-memory 4096m --num-executors 100
--class org.apache.spark.examples.sql.hive.TPCDSSparkSQLHC
/TestAutomation/databricks/spark-sql-perf-master/target/scala-2.10/tpcdssparksql_2.10-0.9.jar

hdfs://rhel2.cisco.com:8020/user/bigsql/hadoopds100g
/TestAutomation/databricks/spark-sql-perf-master/src/main/queries/jesse/query39b.sql






From:   "Cheng, Hao" 
To: Todd 
Cc: Jesse F Chen/San Francisco/IBM@IBMUS, Michael Armbrust
, "user@spark.apache.org"

Date:   09/11/2015 01:00 AM
Subject:RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by
50%+ compared with spark 1.4.1 SQL



Can you confirm if the query really run in the cluster mode? Not the local
mode. Can you print the call stack of the executor when the query is
running?

BTW: spark.shuffle.reduceLocality.enabled is the configuration of Spark,
not Spark SQL.

From: Todd [mailto:bit1...@163.com]
Sent: Friday, September 11, 2015 3:39 PM
To: Todd
Cc: Cheng, Hao; Jesse F Chen; Michael Armbrust; user@spark.apache.org
Subject: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+
compared with spark 1.4.1 SQL

I add the following two options:
spark.sql.planner.sortMergeJoin=false
spark.shuffle.reduceLocality.enabled=false

But it still performs the same as not setting them two.

One thing is that on the spark ui, when I click the SQL tab, it shows an
empty page but the header title 'SQL',there is no table to show queries and
execution plan information.




At 2015-09-11 14:39:06, "Todd"  wrote:


 Thanks Hao.
  Yes,it is still low as SMJ。Let me try the option your suggested,


 At 2015-09-11 14:34:46, "Cheng, Hao"  wrote:

  You mean the performance is still slow as the SMJ in Spark 1.5?

  Can you set the spark.shuffle.reduceLocality.enabled=false when you start
  the spark-shell/spark-sql? It’s a new feature in Spark 1.5, and it’s true
  by default, but we found it probably causes the performance reduce
  dramatically.


  From: Todd [mailto:bit1...@163.com]
  Sent: Friday, September 11, 2015 2:17 PM
  To: Cheng, Hao
  Cc: Jesse F Chen; Michael Armbrust; user@spark.apache.org
  Subject: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared
  with spark 1.4.1 SQL

  Thanks Hao for the reply.
  I turn the merge sort join off, the physical plan is below, but the
  performance is roughly the same as it on...

  == Physical Plan ==
  TungstenProject
  
[ss_quantity#10,ss_list_price#12,ss_coupon_amt#19,ss_cdemo_sk#4,ss_item_sk#2,ss_promo_sk#8,ss_sold_date_sk#0]

   ShuffledHashJoin [ss_item_sk#2], [ss_item_sk#25], BuildRight
TungstenExchange hashpartitioning(ss_item_sk#2)
 ConvertToUnsafe
  Scan ParquetRelation
  
[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_promo_sk#8,ss_quantity#10,ss_cdemo_sk#4,ss_list_price#12,ss_coupon_amt#19,ss_item_sk#2,ss_sold_date_sk#0]

TungstenExchange hashpartitioning(ss_item_sk#25)
 ConvertToUnsafe
  Scan ParquetRelation
  
[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_item_sk#25]


  Code Generation: true



  At 2015-09-11 13:48:23, "Cheng, Hao"  wrote:
  This is not a big surprise the SMJ is slower than the HashJoin, as we do
  not fully utilize the sorting yet, more details can be found at
  https://issues.apache.org/jira/browse/SPARK-2926 .

  Anyway, can you disable the sort merge join by
  “spark.sql.planner.sortMergeJoin=false;” in Spark 1.5, and run the query
  again? In our previous testing, it’s about 20% slower for sort merge
  join. I am not sure if there anything else slow down the performance.

  Hao


  From: Jesse F Chen [mailto:jfc...@us.ibm.com]
  Sent: Friday, September 11, 2015 1:18 PM
  To: Michael Armbrust
  Cc: Todd; user@spark.apache.org
  Subject: Re: spark 1.5 SQL slows down dramatically by 50%+ compared with
  spark 1.4.1 SQL



  Could this be a build issue (i.e., sbt package)?

  If I ran the same jar build for 1.4.1 in 1.5, I am seeing large
  regression too in queries (all other things identical)...

  I am curious, to build 1.5 (when it isn't released yet),