date:20151201

Re: spark-ec2 vs. EMR

2015-12-01 Thread Alexander Pivovarov

1. Emr 4.2.0 has Zeppelin as an alternative to DataBricks Notebooks

2. Emr has Ganglia 3.6.0

3. Emr has hadoop fs settings to make s3 work fast (direct.EmrFileSystem)

4. EMR has s3 keys in hadoop configs

5. EMR allows to resize cluster on fly.

6. EMR has aws sdk in spark classpath. Helps to reduce app assembly jar size

7. ec2 script installs all in /root, EMR has dedicated users: hadoop,
zeppelin, etc. EMR is similar to Cloudera or Hortonworks

8. There are at least 3 spark-ec2 projects. (in apache/spark, in mesos, in
amplab). Master branch in spark has outdated ec2 script. Other projects
have broken links in readme. WHAT A MESS!

9. ec2 script has bad documentation and non informative error messages.
e.g. readme does not say anything about --private-ips option. If you did
not add the flag it will connect to empty string host (localhost) instead
of master. Fixed only last week. Not sure if fixed in all branches

10. I think Amazon will include spark-jobserver to EMR soon.

11. You do not need to be aws expert to start EMR cluster. Users can use
EMR web ui to start cluster to run some jobs or work in Zeppelun during the
day

12. EMR cluster starts in abour 8 min. Ec2 script works longer and you need
to be online.
On Dec 1, 2015 9:22 AM, "Jerry Lam"  wrote:

> Simply put:
>
> EMR = Hadoop Ecosystem (Yarn, HDFS, etc) + Spark + EMRFS + Amazon EMR API
> + Selected Instance Types + Amazon EC2 Friendly (bootstrapping)
> spark-ec2 = HDFS + Yarn (Optional) + Spark (Standalone Default) + Any
> Instance Type
>
> I use spark-ec2 for prototyping and I have never use it for production.
>
> just my $0.02
>
>
>
> On Dec 1, 2015, at 11:15 AM, Nick Chammas 
> wrote:
>
> Pinging this thread in case anyone has thoughts on the matter they want to
> share.
>
> On Sat, Nov 21, 2015 at 11:32 AM Nicholas Chammas <[hidden email]> wrote:
>
>> Spark has come bundled with spark-ec2
>>  for many years.
>> At the same time, EMR has been capable of running Spark for a while, and
>> earlier this year it added "official" support
>> .
>>
>> If you're looking for a way to provision Spark clusters, there are some
>> clear differences between these 2 options. I think the biggest one would be
>> that EMR is a "production" solution backed by a company, whereas spark-ec2
>> is not really intended for production use (as far as I know).
>>
>> That particular difference in intended use may or may not matter to you,
>> but I'm curious:
>>
>> What are some of the other differences between the 2 that do matter to
>> you? If you were considering these 2 solutions for your use case at one
>> point recently, why did you choose one over the other?
>>
>> I'd be especially interested in hearing about why people might choose
>> spark-ec2 over EMR, since the latter option seems to have shaped up nicely
>> this year.
>>
>> Nick
>>
>>
> --
> View this message in context: Re: spark-ec2 vs. EMR
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>
>
>

Re: Low Latency SQL query

2015-12-01 Thread Andrés Ivaldi

Yes,
The use case would be,
Have spark in a service (I didnt invertigate this yet), through api calls
of this service we perform some aggregations over data in SQL, We are
already doing this with an internal development

Nothing complicated, for instance, a table with Product, Product Family,
cost, price, etc. Columns like Dimension and Measures,

I want to Spark for query that table to perform a kind of rollup, with cost
as Measure and Prodcut, Product Family as Dimension

Only 3 columns, it takes like 20s to perform that query and the
aggregation, the  query directly to the database with a grouping at the
columns takes like 1s

regards



On Tue, Dec 1, 2015 at 5:38 PM, Jörn Franke  wrote:

> can you elaborate more on the use case?
>
> > On 01 Dec 2015, at 20:51, Andrés Ivaldi  wrote:
> >
> > Hi,
> >
> > I'd like to use spark to perform some transformations over data stored
> inSQL, but I need low Latency, I'm doing some test and I run into spark
> context creation and data query over SQL takes too long time.
> >
> > Any idea for speed up the process?
> >
> > regards.
> >
> > --
> > Ing. Ivaldi Andres
>



-- 
Ing. Ivaldi Andres

Driver Hangs before starting Job

2015-12-01 Thread Patrick Brown

Hi,

I am building a Spark app which aggregates sensor data stored in Cassandra.
After I submit my app to spark the driver and application show up quickly
then, before any Spark job shows up in the application UI  there is a huge
lag, on the order of minutes to sometimes hours. Once the Spark job itself
gets added, processing is very fast (ms to possibly a few seconds).

While this hang is occurring there is not a high load on my driver (cpu
etc. through top) or my Cassandra cluster (disk/network etc. seen through
top and Datastax Opscenter). Once the job does start you can see the load
spike.

I am accessing Cassandra using the Datastax Spark Cassandra connector. I do
have a lot of Cassandra partitions accessed in each job (7 days * 24 hours
* 50+ sensors) and I can see these Cassandra partitions in the DAG graph. I
have tried coalesce, which seems to help somewhat but the lag is still
orders of magnitude larger than any processing time.

Thanks,

Patrick

Spark DIMSUM Memory requirement?

2015-12-01 Thread Parin Choganwala

Hi All,

I am trying to run RowMatrix.similarity(0.5) on 60K users (n) with 130k 
features (m) on spark 1.3.0.
Using 4 m3.2xlarge 30GB RAM and 8 cores but getting lots of ERROR 
YarnScheduler: Lost executor 1 on XXX.internal: remote Akka client disassociate

What could be the reason?
Is it shuffle memory that I should increase?

Thank You
Parin Choganwala

Re: Getting all files of a table

2015-12-01 Thread Krzysztof Zarzycki

Great that worked! The only problem was that it returned all the files
including _SUCCESS and _metadata, but I filtered only the *.parquet

Thanks Michael,
Krzysztof


2015-12-01 20:20 GMT+01:00 Michael Armbrust :

> sqlContext.table("...").inputFiles
>
> (this is best effort, but should work for hive tables).
>
> Michael
>
> On Tue, Dec 1, 2015 at 10:55 AM, Krzysztof Zarzycki 
> wrote:
>
>> Hi there,
>> Do you know how easily I can get a list of all files of a Hive table?
>>
>> What I want to achieve is to get all files that are underneath parquet
>> table and using sparksql-protobuf[1] library(really handy library!) and its
>> helper class ProtoParquetRDD:
>>
>> val protobufsRdd = new ProtoParquetRDD(sc, "files", classOf[MyProto])
>>
>> Access the underlying parquet files as normal protocol buffers. But I
>> don't know how to get them. I pointed the call above to one file by hand it
>> worked well.
>> The parquet table was created based on the same library and it's implicit
>> hiveContext extension createDataFrame, which creates a DataFrame based on
>> Protocol buffer class.
>>
>> (The revert read operation is needed to support legacy code, where after
>> converting protocol buffers to parquet, I still want some code to access
>> parquet as normal protocol buffers).
>>
>> Maybe someone will have other way to get an Rdd of protocol buffers from
>> Parquet-stored table.
>>
>> [1] https://github.com/saurfang/sparksql-protobuf
>>
>> Thanks!
>> Krzysztof
>>
>>
>>
>>
>

Re: Low Latency SQL query

2015-12-01 Thread Jörn Franke

can you elaborate more on the use case?

> On 01 Dec 2015, at 20:51, Andrés Ivaldi  wrote:
> 
> Hi, 
> 
> I'd like to use spark to perform some transformations over data stored inSQL, 
> but I need low Latency, I'm doing some test and I run into spark context 
> creation and data query over SQL takes too long time. 
> 
> Any idea for speed up the process? 
> 
> regards. 
> 
> -- 
> Ing. Ivaldi Andres

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Recovery for Spark Streaming Kafka Direct in case of issues with Kafka

2015-12-01 Thread swetha kasireddy

Hi Cody,

How to look at Option 2(see the following)? Which portion of the code in
Spark Kafka Direct to look at to handle this issue specific to our
requirements.


2.Catch that exception and somehow force things to "reset" for that
partition And how would it handle the offsets already calculated in the
backlog (if there is one)?

On Tue, Dec 1, 2015 at 6:51 AM, Cody Koeninger  wrote:

> If you're consistently getting offset out of range exceptions, it's
> probably because messages are getting deleted before you've processed them.
>
> The only real way to deal with this is give kafka more retention, consume
> faster, or both.
>
> If you're just looking for a quick "fix" for an infrequent issue, option 4
> is probably easiest.  I wouldn't do that automatically / silently, because
> you're losing data.
>
> On Mon, Nov 30, 2015 at 6:22 PM, SRK  wrote:
>
>> Hi,
>>
>> So, our Streaming Job fails with the following errors. If you see the
>> errors
>> below, they are all related to Kafka losing offsets and
>> OffsetOutOfRangeException.
>>
>> What are the options we have other than fixing Kafka? We would like to do
>> something like the following. How can we achieve 1 and 2 with Spark Kafka
>> Direct?
>>
>> 1.Need to see a way to skip some offsets if they are not available after
>> the
>> max retries are reached..in that case there might be data loss.
>>
>> 2.Catch that exception and somehow force things to "reset" for that
>> partition And how would it handle the offsets already calculated in the
>> backlog (if there is one)?
>>
>> 3.Track the offsets separately, restart the job by providing the offsets.
>>
>> 4.Or a straightforward approach would be to monitor the log for this
>> error,
>> and if it occurs more than X times, kill the job, remove the checkpoint
>> directory, and restart.
>>
>> ERROR DirectKafkaInputDStream: ArrayBuffer(kafka.common.UnknownException,
>> org.apache.spark.SparkException: Couldn't find leader offsets for
>> Set([test_stream,5]))
>>
>>
>>
>> java.lang.ClassNotFoundException:
>> kafka.common.NotLeaderForPartitionException
>>
>> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
>>
>>
>>
>> java.util.concurrent.RejectedExecutionException: Task
>> org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler@a48c5a8
>> rejected from java.util.concurrent.ThreadPoolExecutor@543258e0
>> [Terminated,
>> pool size = 0, active threads = 0, queued tasks = 0, completed tasks =
>> 12112]
>>
>>
>>
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 10
>> in stage 52.0 failed 4 times, most recent failure: Lost task 10.3 in stage
>> 52.0 (TID 255, 172.16.97.97): UnknownReason
>>
>> Exception in thread "streaming-job-executor-0" java.lang.Error:
>> java.lang.InterruptedException
>>
>> Caused by: java.lang.InterruptedException
>>
>> java.lang.ClassNotFoundException: kafka.common.OffsetOutOfRangeException
>>
>> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
>>
>>
>>
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 7
>> in
>> stage 33.0 failed 4 times, most recent failure: Lost task 7.3 in stage
>> 33.0
>> (TID 283, 172.16.97.103): UnknownReason
>>
>> java.lang.ClassNotFoundException: kafka.common.OffsetOutOfRangeException
>>
>> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
>>
>> java.lang.ClassNotFoundException: kafka.common.OffsetOutOfRangeException
>>
>> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Recovery-for-Spark-Streaming-Kafka-Direct-in-case-of-issues-with-Kafka-tp25524.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: Low Latency SQL query

2015-12-01 Thread Josh Rosen

Use a long-lived SparkContext rather than creating a new one for each query.

On Tue, Dec 1, 2015 at 11:52 AM Andrés Ivaldi  wrote:

> Hi,
>
> I'd like to use spark to perform some transformations over data stored
> inSQL, but I need low Latency, I'm doing some test and I run into spark
> context creation and data query over SQL takes too long time.
>
> Any idea for speed up the process?
>
> regards.
>
> --
> Ing. Ivaldi Andres
>

Graph testing question

2015-12-01 Thread Nathan Kronenfeld

I'm trying to test some graph operations I've written using GraphX.

To make sure I catch all appropriate test cases, I'm trying to specify an
input graph that is partitioned a specific way.

Unfortunately, it seems graphx.Graph repartitions and shuffles any input
node and edge RDD I give it.

Is there a way to construct a graph so that it uses the partitions given
and doesn't shuffle everything around?

Thanks,
   -Nathan Kronenfeld

Re: Scala 2.11 and Akka 2.4.0

2015-12-01 Thread Iulian Dragoș

As a I mentioned on the akka mailing list, in case others are following
this thread: the issue isn't with dependencies. It's a bug in the
maven-shade-plugin. It breaks classfiles when creating the assembly jar (it
seems to do some constant propagation). `sbt assembly` doesn't suffer from
this issue, probably because it uses another library for jar merging.

iulian

On Tue, Dec 1, 2015 at 7:21 PM, Boavida, Rodrigo  wrote:

> HI Jacek,
>
> Yes I was told that as well but no one gave me release schedules, and I
> have the immediate need to have Spark Applications communicating with Akka
> clusters based on latest version. I'm aware there is an ongoing effort to
> change to the low level netty implementation but AFAIK it's not available
> yet.
>
> Any suggestions are very welcomed.
>
> Tnks,
> Rod
>
> -Original Message-
> From: Jacek Laskowski [mailto:ja...@japila.pl]
> Sent: 01 December 2015 18:17
> To: Boavida, Rodrigo 
> Cc: user 
> Subject: Re: Scala 2.11 and Akka 2.4.0
>
> On Tue, Dec 1, 2015 at 2:32 PM, RodrigoB 
> wrote:
>
> > I'm currently trying to build spark with Scala 2.11 and Akka 2.4.0.
>
> Why? AFAIK Spark's leaving Akka's boat and joins Netty's.
>
> Jacek
> This email (including any attachments) is proprietary to Aspect Software,
> Inc. and may contain information that is confidential. If you have received
> this message in error, please do not read, copy or forward this message.
> Please notify the sender immediately, delete it from your system and
> destroy any copies. You may not further disclose or distribute this email
> or its attachments.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

--
Iulian Dragos

--
Reactive Apps on the JVM
www.typesafe.com

Question about yarn-cluster mode and spark.driver.allowMultipleContexts

2015-12-01 Thread Anfernee Xu

Hi,

I have a doubt regarding yarn-cluster mode and spark.driver.
allowMultipleContexts for below usercases.

I have a long running backend server where I will create a short-lived
Spark job in response to each user request, base on the fact that by
default multiple Spark Context cannot be created in the same JVM, looks
like I have 2 choices

1) enable spark.driver.allowMultipleContexts

2) run my jobs in yarn-cluster mode instead yarn-client

For 1) I cannot find any official document, so looks like it's not
encouraged, isn't it?
For 2), I want to make sure yarn-cluster will NOT hit such
limitation(single SparkContext per VM), apparently I have to something in
driver side to push the result set back to my application.

Thanks

-- 
--Anfernee

Re: Low Latency SQL query

2015-12-01 Thread Jörn Franke

Hmm it will never be faster than SQL if you use SQL as an underlying storage. 
Spark is (currently) an in-memory batch engine for iterative machine learning 
workloads. It is not designed for interactive queries. 
Currently hive is going into the direction of interactive queries. Alternatives 
are Hbase on Phoenix or Impala.

> On 01 Dec 2015, at 21:58, Andrés Ivaldi  wrote:
> 
> Yes, 
> The use case would be,
> Have spark in a service (I didnt invertigate this yet), through api calls of 
> this service we perform some aggregations over data in SQL, We are already 
> doing this with an internal development
> 
> Nothing complicated, for instance, a table with Product, Product Family, 
> cost, price, etc. Columns like Dimension and Measures,
> 
> I want to Spark for query that table to perform a kind of rollup, with cost 
> as Measure and Prodcut, Product Family as Dimension
> 
> Only 3 columns, it takes like 20s to perform that query and the aggregation, 
> the  query directly to the database with a grouping at the columns takes like 
> 1s 
> 
> regards
> 
> 
> 
>> On Tue, Dec 1, 2015 at 5:38 PM, Jörn Franke  wrote:
>> can you elaborate more on the use case?
>> 
>> > On 01 Dec 2015, at 20:51, Andrés Ivaldi  wrote:
>> >
>> > Hi,
>> >
>> > I'd like to use spark to perform some transformations over data stored 
>> > inSQL, but I need low Latency, I'm doing some test and I run into spark 
>> > context creation and data query over SQL takes too long time.
>> >
>> > Any idea for speed up the process?
>> >
>> > regards.
>> >
>> > --
>> > Ing. Ivaldi Andres
> 
> 
> 
> -- 
> Ing. Ivaldi Andres

Re: Recovery for Spark Streaming Kafka Direct in case of issues with Kafka

2015-12-01 Thread swetha kasireddy

How to avoid those Errors with receiver based approach? Suppose we are OK
with at least once processing and use receiver based approach which uses
ZooKeeper but not query Kafka directly, would these errors(Couldn't find
leader offsets for
Set([test_stream,5])))be avoided?

On Tue, Dec 1, 2015 at 3:40 PM, Cody Koeninger  wrote:

> KafkaRDD.scala , handleFetchErr
>
> On Tue, Dec 1, 2015 at 3:39 PM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> Hi Cody,
>>
>> How to look at Option 2(see the following)? Which portion of the code in
>> Spark Kafka Direct to look at to handle this issue specific to our
>> requirements.
>>
>>
>> 2.Catch that exception and somehow force things to "reset" for that
>> partition And how would it handle the offsets already calculated in the
>> backlog (if there is one)?
>>
>> On Tue, Dec 1, 2015 at 6:51 AM, Cody Koeninger 
>> wrote:
>>
>>> If you're consistently getting offset out of range exceptions, it's
>>> probably because messages are getting deleted before you've processed them.
>>>
>>> The only real way to deal with this is give kafka more retention,
>>> consume faster, or both.
>>>
>>> If you're just looking for a quick "fix" for an infrequent issue, option
>>> 4 is probably easiest.  I wouldn't do that automatically / silently,
>>> because you're losing data.
>>>
>>> On Mon, Nov 30, 2015 at 6:22 PM, SRK  wrote:
>>>
 Hi,

 So, our Streaming Job fails with the following errors. If you see the
 errors
 below, they are all related to Kafka losing offsets and
 OffsetOutOfRangeException.

 What are the options we have other than fixing Kafka? We would like to
 do
 something like the following. How can we achieve 1 and 2 with Spark
 Kafka
 Direct?

 1.Need to see a way to skip some offsets if they are not available
 after the
 max retries are reached..in that case there might be data loss.

 2.Catch that exception and somehow force things to "reset" for that
 partition And how would it handle the offsets already calculated in the
 backlog (if there is one)?

 3.Track the offsets separately, restart the job by providing the
 offsets.

 4.Or a straightforward approach would be to monitor the log for this
 error,
 and if it occurs more than X times, kill the job, remove the checkpoint
 directory, and restart.

 ERROR DirectKafkaInputDStream:
 ArrayBuffer(kafka.common.UnknownException,
 org.apache.spark.SparkException: Couldn't find leader offsets for
 Set([test_stream,5]))



 java.lang.ClassNotFoundException:
 kafka.common.NotLeaderForPartitionException

 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)



 java.util.concurrent.RejectedExecutionException: Task

 org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler@a48c5a8
 rejected from java.util.concurrent.ThreadPoolExecutor@543258e0
 [Terminated,
 pool size = 0, active threads = 0, queued tasks = 0, completed tasks =
 12112]



 org.apache.spark.SparkException: Job aborted due to stage failure: Task
 10
 in stage 52.0 failed 4 times, most recent failure: Lost task 10.3 in
 stage
 52.0 (TID 255, 172.16.97.97): UnknownReason

 Exception in thread "streaming-job-executor-0" java.lang.Error:
 java.lang.InterruptedException

 Caused by: java.lang.InterruptedException

 java.lang.ClassNotFoundException: kafka.common.OffsetOutOfRangeException

 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)



 org.apache.spark.SparkException: Job aborted due to stage failure: Task
 7 in
 stage 33.0 failed 4 times, most recent failure: Lost task 7.3 in stage
 33.0
 (TID 283, 172.16.97.103): UnknownReason

 java.lang.ClassNotFoundException: kafka.common.OffsetOutOfRangeException

 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)

 java.lang.ClassNotFoundException: kafka.common.OffsetOutOfRangeException

 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Recovery-for-Spark-Streaming-Kafka-Direct-in-case-of-issues-with-Kafka-tp25524.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>>
>>
>

Re: Low Latency SQL query

2015-12-01 Thread Andrés Ivaldi

Thanks Jõrn, I didn't expect Spark to be faster than SQL, but just not that
slow.

We are tempted to use Spark as our hub of sources, that way we can access
throw different data sources and normalize it. Currently we are saving the
data in SQL becouse Spark latency, but the best would be execute directly
over Spark.


On Tue, Dec 1, 2015 at 9:05 PM, Jörn Franke  wrote:

> Hmm it will never be faster than SQL if you use SQL as an underlying
> storage. Spark is (currently) an in-memory batch engine for iterative
> machine learning workloads. It is not designed for interactive queries.
> Currently hive is going into the direction of interactive queries.
> Alternatives are Hbase on Phoenix or Impala.
>
> On 01 Dec 2015, at 21:58, Andrés Ivaldi  wrote:
>
> Yes,
> The use case would be,
> Have spark in a service (I didnt invertigate this yet), through api calls
> of this service we perform some aggregations over data in SQL, We are
> already doing this with an internal development
>
> Nothing complicated, for instance, a table with Product, Product Family,
> cost, price, etc. Columns like Dimension and Measures,
>
> I want to Spark for query that table to perform a kind of rollup, with
> cost as Measure and Prodcut, Product Family as Dimension
>
> Only 3 columns, it takes like 20s to perform that query and the
> aggregation, the  query directly to the database with a grouping at the
> columns takes like 1s
>
> regards
>
>
>
> On Tue, Dec 1, 2015 at 5:38 PM, Jörn Franke  wrote:
>
>> can you elaborate more on the use case?
>>
>> > On 01 Dec 2015, at 20:51, Andrés Ivaldi  wrote:
>> >
>> > Hi,
>> >
>> > I'd like to use spark to perform some transformations over data stored
>> inSQL, but I need low Latency, I'm doing some test and I run into spark
>> context creation and data query over SQL takes too long time.
>> >
>> > Any idea for speed up the process?
>> >
>> > regards.
>> >
>> > --
>> > Ing. Ivaldi Andres
>>
>
>
>
> --
> Ing. Ivaldi Andres
>
>


-- 
Ing. Ivaldi Andres

Re: Recovery for Spark Streaming Kafka Direct in case of issues with Kafka

2015-12-01 Thread Cody Koeninger

KafkaRDD.scala , handleFetchErr

On Tue, Dec 1, 2015 at 3:39 PM, swetha kasireddy 
wrote:

> Hi Cody,
>
> How to look at Option 2(see the following)? Which portion of the code in
> Spark Kafka Direct to look at to handle this issue specific to our
> requirements.
>
>
> 2.Catch that exception and somehow force things to "reset" for that
> partition And how would it handle the offsets already calculated in the
> backlog (if there is one)?
>
> On Tue, Dec 1, 2015 at 6:51 AM, Cody Koeninger  wrote:
>
>> If you're consistently getting offset out of range exceptions, it's
>> probably because messages are getting deleted before you've processed them.
>>
>> The only real way to deal with this is give kafka more retention, consume
>> faster, or both.
>>
>> If you're just looking for a quick "fix" for an infrequent issue, option
>> 4 is probably easiest.  I wouldn't do that automatically / silently,
>> because you're losing data.
>>
>> On Mon, Nov 30, 2015 at 6:22 PM, SRK  wrote:
>>
>>> Hi,
>>>
>>> So, our Streaming Job fails with the following errors. If you see the
>>> errors
>>> below, they are all related to Kafka losing offsets and
>>> OffsetOutOfRangeException.
>>>
>>> What are the options we have other than fixing Kafka? We would like to do
>>> something like the following. How can we achieve 1 and 2 with Spark Kafka
>>> Direct?
>>>
>>> 1.Need to see a way to skip some offsets if they are not available after
>>> the
>>> max retries are reached..in that case there might be data loss.
>>>
>>> 2.Catch that exception and somehow force things to "reset" for that
>>> partition And how would it handle the offsets already calculated in the
>>> backlog (if there is one)?
>>>
>>> 3.Track the offsets separately, restart the job by providing the offsets.
>>>
>>> 4.Or a straightforward approach would be to monitor the log for this
>>> error,
>>> and if it occurs more than X times, kill the job, remove the checkpoint
>>> directory, and restart.
>>>
>>> ERROR DirectKafkaInputDStream: ArrayBuffer(kafka.common.UnknownException,
>>> org.apache.spark.SparkException: Couldn't find leader offsets for
>>> Set([test_stream,5]))
>>>
>>>
>>>
>>> java.lang.ClassNotFoundException:
>>> kafka.common.NotLeaderForPartitionException
>>>
>>> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
>>>
>>>
>>>
>>> java.util.concurrent.RejectedExecutionException: Task
>>>
>>> org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler@a48c5a8
>>> rejected from java.util.concurrent.ThreadPoolExecutor@543258e0
>>> [Terminated,
>>> pool size = 0, active threads = 0, queued tasks = 0, completed tasks =
>>> 12112]
>>>
>>>
>>>
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>>> 10
>>> in stage 52.0 failed 4 times, most recent failure: Lost task 10.3 in
>>> stage
>>> 52.0 (TID 255, 172.16.97.97): UnknownReason
>>>
>>> Exception in thread "streaming-job-executor-0" java.lang.Error:
>>> java.lang.InterruptedException
>>>
>>> Caused by: java.lang.InterruptedException
>>>
>>> java.lang.ClassNotFoundException: kafka.common.OffsetOutOfRangeException
>>>
>>> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
>>>
>>>
>>>
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>>> 7 in
>>> stage 33.0 failed 4 times, most recent failure: Lost task 7.3 in stage
>>> 33.0
>>> (TID 283, 172.16.97.103): UnknownReason
>>>
>>> java.lang.ClassNotFoundException: kafka.common.OffsetOutOfRangeException
>>>
>>> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
>>>
>>> java.lang.ClassNotFoundException: kafka.common.OffsetOutOfRangeException
>>>
>>> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Recovery-for-Spark-Streaming-Kafka-Direct-in-case-of-issues-with-Kafka-tp25524.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Re: Low Latency SQL query

2015-12-01 Thread Mark Hamstra

Right, you can't expect a completely cold first query to execute faster
than the data can be retrieved from the underlying datastore.  After that,
lowest latency query performance is largely a matter of caching -- for
which Spark provides at least partial solutions.

On Tue, Dec 1, 2015 at 4:27 PM, Michal Klos  wrote:

> You should consider presto for this use case. If you want fast "first
> query" times it is a better fit.
>
> I think sparksql will catch up at some point but if you are not doing
> multiple queries against data cached in RDDs and need low latency it may
> not be a good fit.
>
> M
>
> On Dec 1, 2015, at 7:23 PM, Andrés Ivaldi  wrote:
>
> Ok, so latency problem is being generated because I'm using SQL as source?
> how about csv, hive, or another source?
>
> On Tue, Dec 1, 2015 at 9:18 PM, Mark Hamstra 
> wrote:
>
>> It is not designed for interactive queries.
>>
>>
>> You might want to ask the designers of Spark, Spark SQL, and particularly
>> some things built on top of Spark (such as BlinkDB) about their intent with
>> regard to interactive queries.  Interactive queries are not the only
>> designed use of Spark, but it is going too far to claim that Spark is not
>> designed at all to handle interactive queries.
>>
>> That being said, I think that you are correct to question the wisdom of
>> expecting lowest-latency query response from Spark using SQL (sic,
>> presumably a RDBMS is intended) as the datastore.
>>
>> On Tue, Dec 1, 2015 at 4:05 PM, Jörn Franke  wrote:
>>
>>> Hmm it will never be faster than SQL if you use SQL as an underlying
>>> storage. Spark is (currently) an in-memory batch engine for iterative
>>> machine learning workloads. It is not designed for interactive queries.
>>> Currently hive is going into the direction of interactive queries.
>>> Alternatives are Hbase on Phoenix or Impala.
>>>
>>> On 01 Dec 2015, at 21:58, Andrés Ivaldi  wrote:
>>>
>>> Yes,
>>> The use case would be,
>>> Have spark in a service (I didnt invertigate this yet), through api
>>> calls of this service we perform some aggregations over data in SQL, We are
>>> already doing this with an internal development
>>>
>>> Nothing complicated, for instance, a table with Product, Product Family,
>>> cost, price, etc. Columns like Dimension and Measures,
>>>
>>> I want to Spark for query that table to perform a kind of rollup, with
>>> cost as Measure and Prodcut, Product Family as Dimension
>>>
>>> Only 3 columns, it takes like 20s to perform that query and the
>>> aggregation, the  query directly to the database with a grouping at the
>>> columns takes like 1s
>>>
>>> regards
>>>
>>>
>>>
>>> On Tue, Dec 1, 2015 at 5:38 PM, Jörn Franke 
>>> wrote:
>>>
 can you elaborate more on the use case?

 > On 01 Dec 2015, at 20:51, Andrés Ivaldi  wrote:
 >
 > Hi,
 >
 > I'd like to use spark to perform some transformations over data
 stored inSQL, but I need low Latency, I'm doing some test and I run into
 spark context creation and data query over SQL takes too long time.
 >
 > Any idea for speed up the process?
 >
 > regards.
 >
 > --
 > Ing. Ivaldi Andres

>>>
>>>
>>>
>>> --
>>> Ing. Ivaldi Andres
>>>
>>>
>>
>
>
> --
> Ing. Ivaldi Andres
>
>

Re: Low Latency SQL query

2015-12-01 Thread Xiao Li

http://cacm.acm.org/magazines/2011/6/108651-10-rules-for-scalable-performance-in-simple-operation-datastores/fulltext

Try to read this article. It might help you understand your problem.

Thanks,

Xiao Li

2015-12-01 16:36 GMT-08:00 Mark Hamstra :

> I'd ask another question first: If your SQL query can be executed in a
> performant fashion against a conventional (RDBMS?) database, why are you
> trying to use Spark?  How you answer that question will be the key to
> deciding among the engineering design tradeoffs to effectively use Spark or
> some other solution.
>
> On Tue, Dec 1, 2015 at 4:23 PM, Andrés Ivaldi  wrote:
>
>> Ok, so latency problem is being generated because I'm using SQL as
>> source? how about csv, hive, or another source?
>>
>> On Tue, Dec 1, 2015 at 9:18 PM, Mark Hamstra 
>> wrote:
>>
>>> It is not designed for interactive queries.
>>>
>>>
>>> You might want to ask the designers of Spark, Spark SQL, and
>>> particularly some things built on top of Spark (such as BlinkDB) about
>>> their intent with regard to interactive queries.  Interactive queries are
>>> not the only designed use of Spark, but it is going too far to claim that
>>> Spark is not designed at all to handle interactive queries.
>>>
>>> That being said, I think that you are correct to question the wisdom of
>>> expecting lowest-latency query response from Spark using SQL (sic,
>>> presumably a RDBMS is intended) as the datastore.
>>>
>>> On Tue, Dec 1, 2015 at 4:05 PM, Jörn Franke 
>>> wrote:
>>>
 Hmm it will never be faster than SQL if you use SQL as an underlying
 storage. Spark is (currently) an in-memory batch engine for iterative
 machine learning workloads. It is not designed for interactive queries.
 Currently hive is going into the direction of interactive queries.
 Alternatives are Hbase on Phoenix or Impala.

 On 01 Dec 2015, at 21:58, Andrés Ivaldi  wrote:

 Yes,
 The use case would be,
 Have spark in a service (I didnt invertigate this yet), through api
 calls of this service we perform some aggregations over data in SQL, We are
 already doing this with an internal development

 Nothing complicated, for instance, a table with Product, Product
 Family, cost, price, etc. Columns like Dimension and Measures,

 I want to Spark for query that table to perform a kind of rollup, with
 cost as Measure and Prodcut, Product Family as Dimension

 Only 3 columns, it takes like 20s to perform that query and the
 aggregation, the  query directly to the database with a grouping at the
 columns takes like 1s

 regards



 On Tue, Dec 1, 2015 at 5:38 PM, Jörn Franke 
 wrote:

> can you elaborate more on the use case?
>
> > On 01 Dec 2015, at 20:51, Andrés Ivaldi  wrote:
> >
> > Hi,
> >
> > I'd like to use spark to perform some transformations over data
> stored inSQL, but I need low Latency, I'm doing some test and I run into
> spark context creation and data query over SQL takes too long time.
> >
> > Any idea for speed up the process?
> >
> > regards.
> >
> > --
> > Ing. Ivaldi Andres
>



 --
 Ing. Ivaldi Andres


>>>
>>
>>
>> --
>> Ing. Ivaldi Andres
>>
>
>

Re: Low Latency SQL query

2015-12-01 Thread Andrés Ivaldi

Ok, so latency problem is being generated because I'm using SQL as source?
how about csv, hive, or another source?

On Tue, Dec 1, 2015 at 9:18 PM, Mark Hamstra 
wrote:

> It is not designed for interactive queries.
>
>
> You might want to ask the designers of Spark, Spark SQL, and particularly
> some things built on top of Spark (such as BlinkDB) about their intent with
> regard to interactive queries.  Interactive queries are not the only
> designed use of Spark, but it is going too far to claim that Spark is not
> designed at all to handle interactive queries.
>
> That being said, I think that you are correct to question the wisdom of
> expecting lowest-latency query response from Spark using SQL (sic,
> presumably a RDBMS is intended) as the datastore.
>
> On Tue, Dec 1, 2015 at 4:05 PM, Jörn Franke  wrote:
>
>> Hmm it will never be faster than SQL if you use SQL as an underlying
>> storage. Spark is (currently) an in-memory batch engine for iterative
>> machine learning workloads. It is not designed for interactive queries.
>> Currently hive is going into the direction of interactive queries.
>> Alternatives are Hbase on Phoenix or Impala.
>>
>> On 01 Dec 2015, at 21:58, Andrés Ivaldi  wrote:
>>
>> Yes,
>> The use case would be,
>> Have spark in a service (I didnt invertigate this yet), through api calls
>> of this service we perform some aggregations over data in SQL, We are
>> already doing this with an internal development
>>
>> Nothing complicated, for instance, a table with Product, Product Family,
>> cost, price, etc. Columns like Dimension and Measures,
>>
>> I want to Spark for query that table to perform a kind of rollup, with
>> cost as Measure and Prodcut, Product Family as Dimension
>>
>> Only 3 columns, it takes like 20s to perform that query and the
>> aggregation, the  query directly to the database with a grouping at the
>> columns takes like 1s
>>
>> regards
>>
>>
>>
>> On Tue, Dec 1, 2015 at 5:38 PM, Jörn Franke  wrote:
>>
>>> can you elaborate more on the use case?
>>>
>>> > On 01 Dec 2015, at 20:51, Andrés Ivaldi  wrote:
>>> >
>>> > Hi,
>>> >
>>> > I'd like to use spark to perform some transformations over data stored
>>> inSQL, but I need low Latency, I'm doing some test and I run into spark
>>> context creation and data query over SQL takes too long time.
>>> >
>>> > Any idea for speed up the process?
>>> >
>>> > regards.
>>> >
>>> > --
>>> > Ing. Ivaldi Andres
>>>
>>
>>
>>
>> --
>> Ing. Ivaldi Andres
>>
>>
>


-- 
Ing. Ivaldi Andres

Re: Low Latency SQL query

2015-12-01 Thread Michal Klos

You should consider presto for this use case. If you want fast "first query" 
times it is a better fit.

I think sparksql will catch up at some point but if you are not doing multiple 
queries against data cached in RDDs and need low latency it may not be a good 
fit.

M

> On Dec 1, 2015, at 7:23 PM, Andrés Ivaldi  wrote:
> 
> Ok, so latency problem is being generated because I'm using SQL as source? 
> how about csv, hive, or another source?
> 
> On Tue, Dec 1, 2015 at 9:18 PM, Mark Hamstra  wrote:
>>> It is not designed for interactive queries.
>> 
>> You might want to ask the designers of Spark, Spark SQL, and particularly 
>> some things built on top of Spark (such as BlinkDB) about their intent with 
>> regard to interactive queries.  Interactive queries are not the only 
>> designed use of Spark, but it is going too far to claim that Spark is not 
>> designed at all to handle interactive queries.
>> 
>> That being said, I think that you are correct to question the wisdom of 
>> expecting lowest-latency query response from Spark using SQL (sic, 
>> presumably a RDBMS is intended) as the datastore.
>> 
>>> On Tue, Dec 1, 2015 at 4:05 PM, Jörn Franke  wrote:
>>> Hmm it will never be faster than SQL if you use SQL as an underlying 
>>> storage. Spark is (currently) an in-memory batch engine for iterative 
>>> machine learning workloads. It is not designed for interactive queries. 
>>> Currently hive is going into the direction of interactive queries. 
>>> Alternatives are Hbase on Phoenix or Impala.
>>> 
 On 01 Dec 2015, at 21:58, Andrés Ivaldi  wrote:
 
 Yes, 
 The use case would be,
 Have spark in a service (I didnt invertigate this yet), through api calls 
 of this service we perform some aggregations over data in SQL, We are 
 already doing this with an internal development
 
 Nothing complicated, for instance, a table with Product, Product Family, 
 cost, price, etc. Columns like Dimension and Measures,
 
 I want to Spark for query that table to perform a kind of rollup, with 
 cost as Measure and Prodcut, Product Family as Dimension
 
 Only 3 columns, it takes like 20s to perform that query and the 
 aggregation, the  query directly to the database with a grouping at the 
 columns takes like 1s 
 
 regards
 
 
 
> On Tue, Dec 1, 2015 at 5:38 PM, Jörn Franke  wrote:
> can you elaborate more on the use case?
> 
> > On 01 Dec 2015, at 20:51, Andrés Ivaldi  wrote:
> >
> > Hi,
> >
> > I'd like to use spark to perform some transformations over data stored 
> > inSQL, but I need low Latency, I'm doing some test and I run into spark 
> > context creation and data query over SQL takes too long time.
> >
> > Any idea for speed up the process?
> >
> > regards.
> >
> > --
> > Ing. Ivaldi Andres
 
 
 
 -- 
 Ing. Ivaldi Andres
> 
> 
> 
> -- 
> Ing. Ivaldi Andres

Re: Question about yarn-cluster mode and spark.driver.allowMultipleContexts

2015-12-01 Thread Ted Yu

For #1, looks like the config is used in test suites:

.set("spark.driver.allowMultipleContexts", "true")
./sql/core/src/test/scala/org/apache/spark/sql/MultiSQLContextsSuite.scala
.set("spark.driver.allowMultipleContexts", "true")
./sql/core/src/test/scala/org/apache/spark/sql/execution/ExchangeCoordinatorSuite.scala
  .set("spark.driver.allowMultipleContexts", "false")
val conf = new SparkConf().set("spark.driver.allowMultipleContexts",
"false")
.set("spark.driver.allowMultipleContexts", "true")
.set("spark.driver.allowMultipleContexts", "true"))
./core/src/test/scala/org/apache/spark/SparkContextSuite.scala

FYI

On Tue, Dec 1, 2015 at 3:32 PM, Anfernee Xu  wrote:

> Hi,
>
> I have a doubt regarding yarn-cluster mode and spark.driver.
> allowMultipleContexts for below usercases.
>
> I have a long running backend server where I will create a short-lived
> Spark job in response to each user request, base on the fact that by
> default multiple Spark Context cannot be created in the same JVM, looks
> like I have 2 choices
>
> 1) enable spark.driver.allowMultipleContexts
>
> 2) run my jobs in yarn-cluster mode instead yarn-client
>
> For 1) I cannot find any official document, so looks like it's not
> encouraged, isn't it?
> For 2), I want to make sure yarn-cluster will NOT hit such
> limitation(single SparkContext per VM), apparently I have to something in
> driver side to push the result set back to my application.
>
> Thanks
>
> --
> --Anfernee
>

Re: Send JsonDocument to Couchbase

2015-12-01 Thread Eyal Sharon

anyone ?
I know that there isn't much experience yet with Couchbase connector

On Tue, Dec 1, 2015 at 4:12 PM, Eyal Sharon  wrote:

> Hi ,
>
> I am still having problems with Couchbase connector . Consider the
> following code fragment which aims to create a Json document to send to
> Couch
>
>
>
>
>
>
> *val muCache = {  val values = JsonArray.from(mu.toArray.map(_.toString))  
> val content = JsonObject.create().put("feature_mean", 
> values).put("features",mu.size)  JsonDocument.create("mu",content)}*
>
> *mu is vector of double
>
>
> This fails because JsonArray is a Java method which doesn't except scala's 
> string. Got this error
>
> *Unsupported type for JsonArray: class [Ljava.lang.String;*
>
>
> 1- What could be a solution  for that  ?
>
> 2- For a more clean usage  ,I thought of using concrete Scala Json  
> libraries, but neither of them supports JsonDocuments
>
> I followed the implantation from  Couchbases spark connector documentation   
> -   
> http://developer.couchbase.com/documentation/server/4.0/connectors/spark-1.0/spark-intro.html
>
>
> Thanks !
>
>
>
>
>
>
>
>

-- 

*This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are 
addressed. Please note that any disclosure, copying or distribution of the 
content of this information is strictly forbidden. If you have received 
this email message in error, please destroy it immediately and notify its 
sender.*

Re: Recovery for Spark Streaming Kafka Direct in case of issues with Kafka

2015-12-01 Thread swetha kasireddy

Following is the Option 2 that I was talking about:

2.Catch that exception and somehow force things to "reset" for that
partition And how would it handle the offsets already calculated in the
backlog (if there is one)?

On Tue, Dec 1, 2015 at 1:39 PM, swetha kasireddy 
wrote:

> Hi Cody,
>
> How to look at Option 2(see the following)? Which portion of the code in
> Spark Kafka Direct to look at to handle this issue specific to our
> requirements.
>
>
> 2.Catch that exception and somehow force things to "reset" for that
> partition And how would it handle the offsets already calculated in the
> backlog (if there is one)?
>
> On Tue, Dec 1, 2015 at 6:51 AM, Cody Koeninger  wrote:
>
>> If you're consistently getting offset out of range exceptions, it's
>> probably because messages are getting deleted before you've processed them.
>>
>> The only real way to deal with this is give kafka more retention, consume
>> faster, or both.
>>
>> If you're just looking for a quick "fix" for an infrequent issue, option
>> 4 is probably easiest.  I wouldn't do that automatically / silently,
>> because you're losing data.
>>
>> On Mon, Nov 30, 2015 at 6:22 PM, SRK  wrote:
>>
>>> Hi,
>>>
>>> So, our Streaming Job fails with the following errors. If you see the
>>> errors
>>> below, they are all related to Kafka losing offsets and
>>> OffsetOutOfRangeException.
>>>
>>> What are the options we have other than fixing Kafka? We would like to do
>>> something like the following. How can we achieve 1 and 2 with Spark Kafka
>>> Direct?
>>>
>>> 1.Need to see a way to skip some offsets if they are not available after
>>> the
>>> max retries are reached..in that case there might be data loss.
>>>
>>> 2.Catch that exception and somehow force things to "reset" for that
>>> partition And how would it handle the offsets already calculated in the
>>> backlog (if there is one)?
>>>
>>> 3.Track the offsets separately, restart the job by providing the offsets.
>>>
>>> 4.Or a straightforward approach would be to monitor the log for this
>>> error,
>>> and if it occurs more than X times, kill the job, remove the checkpoint
>>> directory, and restart.
>>>
>>> ERROR DirectKafkaInputDStream: ArrayBuffer(kafka.common.UnknownException,
>>> org.apache.spark.SparkException: Couldn't find leader offsets for
>>> Set([test_stream,5]))
>>>
>>>
>>>
>>> java.lang.ClassNotFoundException:
>>> kafka.common.NotLeaderForPartitionException
>>>
>>> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
>>>
>>>
>>>
>>> java.util.concurrent.RejectedExecutionException: Task
>>>
>>> org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler@a48c5a8
>>> rejected from java.util.concurrent.ThreadPoolExecutor@543258e0
>>> [Terminated,
>>> pool size = 0, active threads = 0, queued tasks = 0, completed tasks =
>>> 12112]
>>>
>>>
>>>
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>>> 10
>>> in stage 52.0 failed 4 times, most recent failure: Lost task 10.3 in
>>> stage
>>> 52.0 (TID 255, 172.16.97.97): UnknownReason
>>>
>>> Exception in thread "streaming-job-executor-0" java.lang.Error:
>>> java.lang.InterruptedException
>>>
>>> Caused by: java.lang.InterruptedException
>>>
>>> java.lang.ClassNotFoundException: kafka.common.OffsetOutOfRangeException
>>>
>>> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
>>>
>>>
>>>
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>>> 7 in
>>> stage 33.0 failed 4 times, most recent failure: Lost task 7.3 in stage
>>> 33.0
>>> (TID 283, 172.16.97.103): UnknownReason
>>>
>>> java.lang.ClassNotFoundException: kafka.common.OffsetOutOfRangeException
>>>
>>> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
>>>
>>> java.lang.ClassNotFoundException: kafka.common.OffsetOutOfRangeException
>>>
>>> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Recovery-for-Spark-Streaming-Kafka-Direct-in-case-of-issues-with-Kafka-tp25524.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Re: Low Latency SQL query

2015-12-01 Thread Mark Hamstra

I'd ask another question first: If your SQL query can be executed in a
performant fashion against a conventional (RDBMS?) database, why are you
trying to use Spark?  How you answer that question will be the key to
deciding among the engineering design tradeoffs to effectively use Spark or
some other solution.

On Tue, Dec 1, 2015 at 4:23 PM, Andrés Ivaldi  wrote:

> Ok, so latency problem is being generated because I'm using SQL as source?
> how about csv, hive, or another source?
>
> On Tue, Dec 1, 2015 at 9:18 PM, Mark Hamstra 
> wrote:
>
>> It is not designed for interactive queries.
>>
>>
>> You might want to ask the designers of Spark, Spark SQL, and particularly
>> some things built on top of Spark (such as BlinkDB) about their intent with
>> regard to interactive queries.  Interactive queries are not the only
>> designed use of Spark, but it is going too far to claim that Spark is not
>> designed at all to handle interactive queries.
>>
>> That being said, I think that you are correct to question the wisdom of
>> expecting lowest-latency query response from Spark using SQL (sic,
>> presumably a RDBMS is intended) as the datastore.
>>
>> On Tue, Dec 1, 2015 at 4:05 PM, Jörn Franke  wrote:
>>
>>> Hmm it will never be faster than SQL if you use SQL as an underlying
>>> storage. Spark is (currently) an in-memory batch engine for iterative
>>> machine learning workloads. It is not designed for interactive queries.
>>> Currently hive is going into the direction of interactive queries.
>>> Alternatives are Hbase on Phoenix or Impala.
>>>
>>> On 01 Dec 2015, at 21:58, Andrés Ivaldi  wrote:
>>>
>>> Yes,
>>> The use case would be,
>>> Have spark in a service (I didnt invertigate this yet), through api
>>> calls of this service we perform some aggregations over data in SQL, We are
>>> already doing this with an internal development
>>>
>>> Nothing complicated, for instance, a table with Product, Product Family,
>>> cost, price, etc. Columns like Dimension and Measures,
>>>
>>> I want to Spark for query that table to perform a kind of rollup, with
>>> cost as Measure and Prodcut, Product Family as Dimension
>>>
>>> Only 3 columns, it takes like 20s to perform that query and the
>>> aggregation, the  query directly to the database with a grouping at the
>>> columns takes like 1s
>>>
>>> regards
>>>
>>>
>>>
>>> On Tue, Dec 1, 2015 at 5:38 PM, Jörn Franke 
>>> wrote:
>>>
 can you elaborate more on the use case?

 > On 01 Dec 2015, at 20:51, Andrés Ivaldi  wrote:
 >
 > Hi,
 >
 > I'd like to use spark to perform some transformations over data
 stored inSQL, but I need low Latency, I'm doing some test and I run into
 spark context creation and data query over SQL takes too long time.
 >
 > Any idea for speed up the process?
 >
 > regards.
 >
 > --
 > Ing. Ivaldi Andres

>>>
>>>
>>>
>>> --
>>> Ing. Ivaldi Andres
>>>
>>>
>>
>
>
> --
> Ing. Ivaldi Andres
>

how to use spark.mesos.constraints

2015-12-01 Thread rarediel

I am trying to add mesos constraints to my spark-submit command in my
marathon file I am setting it to spark.mesos.coarse=true.

Here is an example of a constraint I am trying to set.

 --conf spark.mesos.constraint=cpus:2 

I want to use the constraints to control the amount of executors are created
so I can control the total memory of my spark job.

I've tried many variations of resource constraints, but no matter which
resource or what number, range, etc. I do I always get the error "Initial
job has not accepted any resources; check your cluster UI...".  My cluster
has the available resources.  Is there any examples I can look at where
people use resource constraints?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-use-spark-mesos-constraints-tp25541.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Low Latency SQL query

2015-12-01 Thread Andrés Ivaldi

Mark, We have an application that use data from different kind of source,
and we build a engine able to handle that, but cant scale with big data(we
could but is to time expensive), and doesn't have Machine learning module,
etc, we came across with Spark and it's looks like it have all we need,
actually it does, but our latency is very low right now, and when we do
some testing it took too long time for the same kind of results, always
against RDBM which is our primary source.

So, we want to expand our sources, to cvs, web service, big data, etc, we
can extend our engine or use something like Spark, which give as power of
clustering, different kind of source access, streaming, machine learning,
easy extensibility and so on.

On Tue, Dec 1, 2015 at 9:36 PM, Mark Hamstra 
wrote:

> I'd ask another question first: If your SQL query can be executed in a
> performant fashion against a conventional (RDBMS?) database, why are you
> trying to use Spark?  How you answer that question will be the key to
> deciding among the engineering design tradeoffs to effectively use Spark or
> some other solution.
>
> On Tue, Dec 1, 2015 at 4:23 PM, Andrés Ivaldi  wrote:
>
>> Ok, so latency problem is being generated because I'm using SQL as
>> source? how about csv, hive, or another source?
>>
>> On Tue, Dec 1, 2015 at 9:18 PM, Mark Hamstra 
>> wrote:
>>
>>> It is not designed for interactive queries.
>>>
>>>
>>> You might want to ask the designers of Spark, Spark SQL, and
>>> particularly some things built on top of Spark (such as BlinkDB) about
>>> their intent with regard to interactive queries.  Interactive queries are
>>> not the only designed use of Spark, but it is going too far to claim that
>>> Spark is not designed at all to handle interactive queries.
>>>
>>> That being said, I think that you are correct to question the wisdom of
>>> expecting lowest-latency query response from Spark using SQL (sic,
>>> presumably a RDBMS is intended) as the datastore.
>>>
>>> On Tue, Dec 1, 2015 at 4:05 PM, Jörn Franke 
>>> wrote:
>>>
 Hmm it will never be faster than SQL if you use SQL as an underlying
 storage. Spark is (currently) an in-memory batch engine for iterative
 machine learning workloads. It is not designed for interactive queries.
 Currently hive is going into the direction of interactive queries.
 Alternatives are Hbase on Phoenix or Impala.

 On 01 Dec 2015, at 21:58, Andrés Ivaldi  wrote:

 Yes,
 The use case would be,
 Have spark in a service (I didnt invertigate this yet), through api
 calls of this service we perform some aggregations over data in SQL, We are
 already doing this with an internal development

 Nothing complicated, for instance, a table with Product, Product
 Family, cost, price, etc. Columns like Dimension and Measures,

 I want to Spark for query that table to perform a kind of rollup, with
 cost as Measure and Prodcut, Product Family as Dimension

 Only 3 columns, it takes like 20s to perform that query and the
 aggregation, the  query directly to the database with a grouping at the
 columns takes like 1s

 regards



 On Tue, Dec 1, 2015 at 5:38 PM, Jörn Franke 
 wrote:

> can you elaborate more on the use case?
>
> > On 01 Dec 2015, at 20:51, Andrés Ivaldi  wrote:
> >
> > Hi,
> >
> > I'd like to use spark to perform some transformations over data
> stored inSQL, but I need low Latency, I'm doing some test and I run into
> spark context creation and data query over SQL takes too long time.
> >
> > Any idea for speed up the process?
> >
> > regards.
> >
> > --
> > Ing. Ivaldi Andres
>



 --
 Ing. Ivaldi Andres


>>>
>>
>>
>> --
>> Ing. Ivaldi Andres
>>
>
>


-- 
Ing. Ivaldi Andres

Re: Question about yarn-cluster mode and spark.driver.allowMultipleContexts

2015-12-01 Thread Ted Yu

Looks like #2 is better choice.

On Tue, Dec 1, 2015 at 4:51 PM, Anfernee Xu  wrote:

> Thanks Ted, so 1) is off from the table, can I go with 2), yarn-cluster
> mode? As the driver is running as a Yarn container, it's should be OK for
> my usercase, isn't it?
>
> Anfernee
>
> On Tue, Dec 1, 2015 at 4:48 PM, Ted Yu  wrote:
>
>> For #1, looks like the config is used in test suites:
>>
>> .set("spark.driver.allowMultipleContexts", "true")
>> ./sql/core/src/test/scala/org/apache/spark/sql/MultiSQLContextsSuite.scala
>> .set("spark.driver.allowMultipleContexts", "true")
>>
>> ./sql/core/src/test/scala/org/apache/spark/sql/execution/ExchangeCoordinatorSuite.scala
>>   .set("spark.driver.allowMultipleContexts", "false")
>> val conf = new SparkConf().set("spark.driver.allowMultipleContexts",
>> "false")
>> .set("spark.driver.allowMultipleContexts", "true")
>> .set("spark.driver.allowMultipleContexts", "true"))
>> ./core/src/test/scala/org/apache/spark/SparkContextSuite.scala
>>
>> FYI
>>
>> On Tue, Dec 1, 2015 at 3:32 PM, Anfernee Xu 
>> wrote:
>>
>>> Hi,
>>>
>>> I have a doubt regarding yarn-cluster mode and spark.driver.
>>> allowMultipleContexts for below usercases.
>>>
>>> I have a long running backend server where I will create a short-lived
>>> Spark job in response to each user request, base on the fact that by
>>> default multiple Spark Context cannot be created in the same JVM, looks
>>> like I have 2 choices
>>>
>>> 1) enable spark.driver.allowMultipleContexts
>>>
>>> 2) run my jobs in yarn-cluster mode instead yarn-client
>>>
>>> For 1) I cannot find any official document, so looks like it's not
>>> encouraged, isn't it?
>>> For 2), I want to make sure yarn-cluster will NOT hit such
>>> limitation(single SparkContext per VM), apparently I have to something in
>>> driver side to push the result set back to my application.
>>>
>>> Thanks
>>>
>>> --
>>> --Anfernee
>>>
>>
>>
>
>
> --
> --Anfernee
>

Re: New to Spark

2015-12-01 Thread fightf...@163.com

Hi，there 
Which version spark in your use case ? You made hive metastore to be used by 
Spark, 
that mean you can run sql queries over the current hive tables , right ? Or you 
just use 
local hive metastore embeded in spark sql side ? I think you need to provide 
more info
for your spark sql and hive config, that would help to locate root cause for 
the problem.

Best,
Sun.



fightf...@163.com
 
From: Ashok Kumar
Date: 2015-12-01 18:54
To: user@spark.apache.org
Subject: New to Spark

Hi,

I am new to Spark.

I am trying to use spark-sql with SPARK CREATED and HIVE CREATED tables.

I have successfully made Hive metastore to be used by Spark.

In spark-sql I can see the DDL for Hive tables. However, when I do select 
count(1) from HIVE_TABLE it always returns zero rows.

If I create a table in spark as create table SPARK_TABLE as select * from 
HIVE_TABLE, the table schema is created but no data.

I can then use Hive to do INSET SELECT from HIVE_TABLE into SPARK_TABLE. That 
works.

I can then use spark-sql to query the table.

My questions:

Is this correct that spark-sql only sees data in spark created tables but not 
any data in Hive tables?
How can I make Spark read data from existing Hive tables.


Thanks

Re: New to Spark

2015-12-01 Thread Ted Yu

Have you tried the following command ?
REFRESH TABLE 

Cheers



On Tue, Dec 1, 2015 at 1:54 AM, Ashok Kumar 
wrote:

> Hi,
>
> I am new to Spark.
>
> I am trying to use spark-sql with SPARK CREATED and HIVE CREATED tables.
>
> I have successfully made Hive metastore to be used by Spark.
>
> In spark-sql I can see the DDL for Hive tables. However, when I do select
> count(1) from HIVE_TABLE it always returns zero rows.
>
> If I create a table in spark as create table SPARK_TABLE as select * from
> HIVE_TABLE, the table schema is created but no data.
>
> I can then use Hive to do INSET SELECT from HIVE_TABLE into SPARK_TABLE.
> That works.
>
> I can then use spark-sql to query the table.
>
> My questions:
>
>
>1. Is this correct that spark-sql only sees data in spark created
>tables but not any data in Hive tables?
>2. How can I make Spark read data from existing Hive tables.
>
>
>
> Thanks
>

Re: Question about yarn-cluster mode and spark.driver.allowMultipleContexts

2015-12-01 Thread Josh Rosen

Yep, you shouldn't enable *spark.driver.allowMultipleContexts* since it has
the potential to cause extremely difficult-to-debug task failures; it was
originally introduced as an escape-hatch to allow users whose workloads
happened to work "by accident" to continue using multiple active contexts,
but I would not write any new code which uses it.

On Tue, Dec 1, 2015 at 5:45 PM, Ted Yu  wrote:

> Looks like #2 is better choice.
>
> On Tue, Dec 1, 2015 at 4:51 PM, Anfernee Xu  wrote:
>
>> Thanks Ted, so 1) is off from the table, can I go with 2), yarn-cluster
>> mode? As the driver is running as a Yarn container, it's should be OK for
>> my usercase, isn't it?
>>
>> Anfernee
>>
>> On Tue, Dec 1, 2015 at 4:48 PM, Ted Yu  wrote:
>>
>>> For #1, looks like the config is used in test suites:
>>>
>>> .set("spark.driver.allowMultipleContexts", "true")
>>>
>>> ./sql/core/src/test/scala/org/apache/spark/sql/MultiSQLContextsSuite.scala
>>> .set("spark.driver.allowMultipleContexts", "true")
>>>
>>> ./sql/core/src/test/scala/org/apache/spark/sql/execution/ExchangeCoordinatorSuite.scala
>>>   .set("spark.driver.allowMultipleContexts", "false")
>>> val conf = new SparkConf().set("spark.driver.allowMultipleContexts",
>>> "false")
>>> .set("spark.driver.allowMultipleContexts", "true")
>>> .set("spark.driver.allowMultipleContexts", "true"))
>>> ./core/src/test/scala/org/apache/spark/SparkContextSuite.scala
>>>
>>> FYI
>>>
>>> On Tue, Dec 1, 2015 at 3:32 PM, Anfernee Xu 
>>> wrote:
>>>
 Hi,

 I have a doubt regarding yarn-cluster mode and spark.driver.
 allowMultipleContexts for below usercases.

 I have a long running backend server where I will create a short-lived
 Spark job in response to each user request, base on the fact that by
 default multiple Spark Context cannot be created in the same JVM, looks
 like I have 2 choices

 1) enable spark.driver.allowMultipleContexts

 2) run my jobs in yarn-cluster mode instead yarn-client

 For 1) I cannot find any official document, so looks like it's not
 encouraged, isn't it?
 For 2), I want to make sure yarn-cluster will NOT hit such
 limitation(single SparkContext per VM), apparently I have to something in
 driver side to push the result set back to my application.

 Thanks

 --
 --Anfernee

>>>
>>>
>>
>>
>> --
>> --Anfernee
>>
>
>

Re: Can Spark Execute Hive Update/Delete operations

2015-12-01 Thread Ted Yu

Can you tell us the version of Spark and hive you use ?

Thanks

On Tue, Dec 1, 2015 at 7:08 PM, 张炜  wrote:

> Dear all,
> We have a requirement that needs to update delete records in hive. These
> operations are available in hive now.
>
> But when using hiveContext in Spark, it always pops up an "not supported"
> error.
> Is there anyway to support update/delete operations using spark?
>
> Regards,
> Sai
>

Re: Question about yarn-cluster mode and spark.driver.allowMultipleContexts

2015-12-01 Thread Anfernee Xu

Thanks Ted, so 1) is off from the table, can I go with 2), yarn-cluster
mode? As the driver is running as a Yarn container, it's should be OK for
my usercase, isn't it?

Anfernee

On Tue, Dec 1, 2015 at 4:48 PM, Ted Yu  wrote:

> For #1, looks like the config is used in test suites:
>
> .set("spark.driver.allowMultipleContexts", "true")
> ./sql/core/src/test/scala/org/apache/spark/sql/MultiSQLContextsSuite.scala
> .set("spark.driver.allowMultipleContexts", "true")
>
> ./sql/core/src/test/scala/org/apache/spark/sql/execution/ExchangeCoordinatorSuite.scala
>   .set("spark.driver.allowMultipleContexts", "false")
> val conf = new SparkConf().set("spark.driver.allowMultipleContexts",
> "false")
> .set("spark.driver.allowMultipleContexts", "true")
> .set("spark.driver.allowMultipleContexts", "true"))
> ./core/src/test/scala/org/apache/spark/SparkContextSuite.scala
>
> FYI
>
> On Tue, Dec 1, 2015 at 3:32 PM, Anfernee Xu  wrote:
>
>> Hi,
>>
>> I have a doubt regarding yarn-cluster mode and spark.driver.
>> allowMultipleContexts for below usercases.
>>
>> I have a long running backend server where I will create a short-lived
>> Spark job in response to each user request, base on the fact that by
>> default multiple Spark Context cannot be created in the same JVM, looks
>> like I have 2 choices
>>
>> 1) enable spark.driver.allowMultipleContexts
>>
>> 2) run my jobs in yarn-cluster mode instead yarn-client
>>
>> For 1) I cannot find any official document, so looks like it's not
>> encouraged, isn't it?
>> For 2), I want to make sure yarn-cluster will NOT hit such
>> limitation(single SparkContext per VM), apparently I have to something in
>> driver side to push the result set back to my application.
>>
>> Thanks
>>
>> --
>> --Anfernee
>>
>
>


-- 
--Anfernee

Re: Low Latency SQL query

2015-12-01 Thread Fengdong Yu

It depends on many situations:

1) what’s your data format?  csv(text) or ORC/parquet?
2) Did you have Data warehouse to summary/cluster  your data?


if your data is text or you query for the raw data, It should be slow, Spark 
cannot do much to optimize your job.




> On Dec 2, 2015, at 9:21 AM, Andrés Ivaldi  wrote:
> 
> Mark, We have an application that use data from different kind of source, and 
> we build a engine able to handle that, but cant scale with big data(we could 
> but is to time expensive), and doesn't have Machine learning module, etc, we 
> came across with Spark and it's looks like it have all we need, actually it 
> does, but our latency is very low right now, and when we do some testing it 
> took too long time for the same kind of results, always against RDBM which is 
> our primary source. 
> 
> So, we want to expand our sources, to cvs, web service, big data, etc, we can 
> extend our engine or use something like Spark, which give as power of 
> clustering, different kind of source access, streaming, machine learning, 
> easy extensibility and so on. 
> 
> On Tue, Dec 1, 2015 at 9:36 PM, Mark Hamstra  > wrote:
> I'd ask another question first: If your SQL query can be executed in a 
> performant fashion against a conventional (RDBMS?) database, why are you 
> trying to use Spark?  How you answer that question will be the key to 
> deciding among the engineering design tradeoffs to effectively use Spark or 
> some other solution.
> 
> On Tue, Dec 1, 2015 at 4:23 PM, Andrés Ivaldi  > wrote:
> Ok, so latency problem is being generated because I'm using SQL as source? 
> how about csv, hive, or another source?
> 
> On Tue, Dec 1, 2015 at 9:18 PM, Mark Hamstra  > wrote:
> It is not designed for interactive queries.
> 
> You might want to ask the designers of Spark, Spark SQL, and particularly 
> some things built on top of Spark (such as BlinkDB) about their intent with 
> regard to interactive queries.  Interactive queries are not the only designed 
> use of Spark, but it is going too far to claim that Spark is not designed at 
> all to handle interactive queries.
> 
> That being said, I think that you are correct to question the wisdom of 
> expecting lowest-latency query response from Spark using SQL (sic, presumably 
> a RDBMS is intended) as the datastore.
> 
> On Tue, Dec 1, 2015 at 4:05 PM, Jörn Franke  > wrote:
> Hmm it will never be faster than SQL if you use SQL as an underlying storage. 
> Spark is (currently) an in-memory batch engine for iterative machine learning 
> workloads. It is not designed for interactive queries. 
> Currently hive is going into the direction of interactive queries. 
> Alternatives are Hbase on Phoenix or Impala.
> 
> On 01 Dec 2015, at 21:58, Andrés Ivaldi  > wrote:
> 
>> Yes, 
>> The use case would be,
>> Have spark in a service (I didnt invertigate this yet), through api calls of 
>> this service we perform some aggregations over data in SQL, We are already 
>> doing this with an internal development
>> 
>> Nothing complicated, for instance, a table with Product, Product Family, 
>> cost, price, etc. Columns like Dimension and Measures,
>> 
>> I want to Spark for query that table to perform a kind of rollup, with cost 
>> as Measure and Prodcut, Product Family as Dimension
>> 
>> Only 3 columns, it takes like 20s to perform that query and the aggregation, 
>> the  query directly to the database with a grouping at the columns takes 
>> like 1s 
>> 
>> regards
>> 
>> 
>> 
>> On Tue, Dec 1, 2015 at 5:38 PM, Jörn Franke > > wrote:
>> can you elaborate more on the use case?
>> 
>> > On 01 Dec 2015, at 20:51, Andrés Ivaldi > > > wrote:
>> >
>> > Hi,
>> >
>> > I'd like to use spark to perform some transformations over data stored 
>> > inSQL, but I need low Latency, I'm doing some test and I run into spark 
>> > context creation and data query over SQL takes too long time.
>> >
>> > Any idea for speed up the process?
>> >
>> > regards.
>> >
>> > --
>> > Ing. Ivaldi Andres
>> 
>> 
>> 
>> -- 
>> Ing. Ivaldi Andres
> 
> 
> 
> 
> -- 
> Ing. Ivaldi Andres
> 
> 
> 
> 
> -- 
> Ing. Ivaldi Andres

Re: Scala 2.11 and Akka 2.4.0

2015-12-01 Thread Ted Yu

>From the dependency tree, akka 2.4.0 was in effect.

Maybe check the classpath of master to see if there is older version of
akka.

Cheers

Re: Turning off DTD Validation using XML Utils package - Spark

2015-12-01 Thread Darin McBeath



The problem isn't really with DTD validation (by default validation is 
disabled).  The underlying problem is that the DTD can't be found (which is 
indicated in your stack trace below).  The underlying parser will try and 
retrieve the DTD (regardless of  validation) because things such as entities 
could be expressed in the DTD.

I will explore providing access to some of the underlying 'processor' 
configurations.  For example, you could provide your own EntityResolver class 
that could either completely ignore the Doctype declaration (return a 'dummy' 
DTD that is completely empty) or you could have it find 'local' versions (on 
the workers or in S3 and then cache them locally for performance).  

I will post an update when the code has been adjusted.

Darin.

- Original Message -
From: Shivalik 
To: user@spark.apache.org
Sent: Tuesday, December 1, 2015 8:15 AM
Subject: Turning off DTD Validation using XML Utils package - Spark

Hi Team,

I've been using XML Utils library 
(http://spark-packages.org/package/elsevierlabs-os/spark-xml-utils) to parse
XML using XPath in a spark job. One problem I am facing is with the DTDs. My
XML file, has a doctype tag included in it.

I want to turn off DTD validation using this library since I don't have
access to DTD file. Has someone faced this problem before. Please help.

The exception I am getting it is as below:

stage 0.0 (TID 0, localhost):
com.elsevier.spark_xml_utils.xpath.XPathException: I/O error reported by XML
parser processing null: /filename.dtd (No such file or directory)

at
com.elsevier.spark_xml_utils.xpath.XPathProcessor.evaluate(XPathProcessor.java:301)

at
com.elsevier.spark_xml_utils.xpath.XPathProcessor.evaluateString(XPathProcessor.java:219)

at com.thomsonreuters.xmlutils.XMLParser.lambda$0(XMLParser.java:31)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Turning-off-DTD-Validation-using-XML-Utils-package-Spark-tp25534.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Scala 2.11 and Akka 2.4.0

2015-12-01 Thread Jacek Laskowski

On Tue, Dec 1, 2015 at 2:32 PM, RodrigoB  wrote:

> I'm currently trying to build spark with Scala 2.11 and Akka 2.4.0.

Why? AFAIK Spark's leaving Akka's boat and joins Netty's.

Jacek

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Scala 2.11 and Akka 2.4.0

2015-12-01 Thread Boavida, Rodrigo

HI Jacek,

Yes I was told that as well but no one gave me release schedules, and I have 
the immediate need to have Spark Applications communicating with Akka clusters 
based on latest version. I'm aware there is an ongoing effort to change to the 
low level netty implementation but AFAIK it's not available yet.

Any suggestions are very welcomed.

Tnks,
Rod

-Original Message-
From: Jacek Laskowski [mailto:ja...@japila.pl]
Sent: 01 December 2015 18:17
To: Boavida, Rodrigo 
Cc: user 
Subject: Re: Scala 2.11 and Akka 2.4.0

On Tue, Dec 1, 2015 at 2:32 PM, RodrigoB  wrote:

> I'm currently trying to build spark with Scala 2.11 and Akka 2.4.0.

Why? AFAIK Spark's leaving Akka's boat and joins Netty's.

Jacek
This email (including any attachments) is proprietary to Aspect Software, Inc. 
and may contain information that is confidential. If you have received this 
message in error, please do not read, copy or forward this message. Please 
notify the sender immediately, delete it from your system and destroy any 
copies. You may not further disclose or distribute this email or its 
attachments.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Grid search with Random Forest

2015-12-01 Thread Joseph Bradley

You can do grid search if you set the evaluator to a
MulticlassClassificationEvaluator, which expects a prediction column, not a
rawPrediction column.  There's a JIRA for making
BinaryClassificationEvaluator accept prediction instead of rawPrediction.
Joseph

On Tue, Dec 1, 2015 at 5:10 AM, Benjamin Fradet 
wrote:

> Someone correct me if I'm wrong but no there isn't one that I am aware of.
>
> Unless someone is willing to explain how to obtain the raw prediction
> column with the GBTClassifier. In this case I'd be happy to work on a PR.
> On 1 Dec 2015 8:43 a.m., "Ndjido Ardo BAR"  wrote:
>
>> Hi Benjamin,
>>
>> Thanks, the documentation you sent is clear.
>> Is there any other way to perform a Grid Search with GBT?
>>
>>
>> Ndjido
>> On Tue, 1 Dec 2015 at 08:32, Benjamin Fradet 
>> wrote:
>>
>>> Hi Ndjido,
>>>
>>> This is because GBTClassifier doesn't yet have a rawPredictionCol like
>>> the. RandomForestClassifier has.
>>> Cf:
>>> http://spark.apache.org/docs/latest/ml-ensembles.html#output-columns-predictions-1
>>> On 1 Dec 2015 3:57 a.m., "Ndjido Ardo BAR"  wrote:
>>>
 Hi Joseph,

 Yes Random Forest support Grid Search on Spark 1.5.+ . But I'm getting
 a "rawPredictionCol field does not exist exception" on Spark 1.5.2 for
 Gradient Boosting Trees classifier.


 Ardo
 On Tue, 1 Dec 2015 at 01:34, Joseph Bradley 
 wrote:

> It should work with 1.5+.
>
> On Thu, Nov 26, 2015 at 12:53 PM, Ndjido Ardo Bar 
> wrote:
>
>>
>> Hi folks,
>>
>> Does anyone know whether the Grid Search capability is enabled since
>> the issue spark-9011 of version 1.4.0 ? I'm getting the "rawPredictionCol
>> column doesn't exist" when trying to perform a grid search with Spark 
>> 1.4.0.
>>
>> Cheers,
>> Ardo
>>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: General question on using StringIndexer in SparkML

2015-12-01 Thread Vishnu Viswanath

Hi Jeff,

I went through the link you provided and I could understand how the fit()
and transform() work.
I tried to use the pipeline in my code and I am getting exception  Caused
by: org.apache.spark.SparkException: Unseen label:

The reason for this error as per my understanding is:
For the column on which I am doing StringIndexing, the test data is having
values which was not there in train data.
Since fit() is done only on the train data, the indexing is failing.

Can you suggest me what can be done in this situation.

Thanks,

On Mon, Nov 30, 2015 at 12:32 AM, Vishnu Viswanath <
vishnu.viswanat...@gmail.com> wrote:

Thank you Jeff.
>
> On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang  wrote:
>
>> StringIndexer is an estimator which would train a model to be used both
>> in training & prediction. So it is consistent between training & prediction.
>>
>> You may want to read this section of spark ml doc
>> http://spark.apache.org/docs/latest/ml-guide.html#how-it-works
>>
>>
>>
>> On Mon, Nov 30, 2015 at 12:52 AM, Vishnu Viswanath <
>> vishnu.viswanat...@gmail.com> wrote:
>>
>>> Thanks for the reply Yanbo.
>>>
>>> I understand that the model will be trained using the indexer map
>>> created during the training stage.
>>>
>>> But since I am getting a new set of data during prediction, and I have
>>> to do StringIndexing on the new data also,
>>> Right now I am using a new StringIndexer for this purpose, or is there
>>> any way that I can reuse the Indexer used for training stage.
>>>
>>> Note: I am having a pipeline with StringIndexer in it, and I am fitting
>>> my train data in it and building the model. Then later when i get the new
>>> data for prediction, I am using the same pipeline to fit the data again and
>>> do the prediction.
>>>
>>> Thanks and Regards,
>>> Vishnu Viswanath
>>>
>>>
>>> On Sun, Nov 29, 2015 at 8:14 AM, Yanbo Liang  wrote:
>>>
 Hi Vishnu,

 The string and indexer map is generated at model training step and
 used at model prediction step.
 It means that the string and indexer map will not changed when
 prediction. You will use the original trained model when you do
 prediction.

 2015-11-29 4:33 GMT+08:00 Vishnu Viswanath <
 vishnu.viswanat...@gmail.com>:
 > Hi All,
 >
 > I have a general question on using StringIndexer.
 > StringIndexer gives an index to each label in the feature starting
 from 0 (
 > 0 for least frequent word).
 >
 > Suppose I am building a model, and I use StringIndexer for
 transforming on
 > of my column.
 > e.g., suppose A was most frequent word followed by B and C.
 >
 > So the StringIndexer will generate
 >
 > A  0.0
 > B  1.0
 > C  2.0
 >
 > After building the model, I am going to do some prediction using this
 model,
 > So I do the same transformation on my new data which I need to
 predict. And
 > suppose the new dataset has C as the most frequent word, followed by
 B and
 > A. So the StringIndexer will assign index as
 >
 > C 0.0
 > B 1.0
 > A 2.0
 >
 > These indexes are different from what we used for modeling. So won’t
 this
 > give me a wrong prediction if I use StringIndexer?
 >
 >

>>>
>>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
>

Re: Spark on Mesos with Centos 6.6 NFS

2015-12-01 Thread Akhil Das

Can you try mounting the NFS directory on all machines on the same
location? (say /mnt/nfs) and try it again?

Thanks
Best Regards

On Thu, Nov 26, 2015 at 1:22 PM, leonidas  wrote:

> Hello,
> I have a setup with spark 1.5.1 on top of Mesos with one master and 4
> slaves. I am submitting a Spark job were its output (3 parquet folders that
> will represent 3 dataframes) should be written in an shared NFS folder. I
> keep getting an error though though:
>
> 15/11/25 10:07:22 WARN TaskSetManager: Lost task 4.0 in stage 12.0 (TID
> 711,
> remoteMachineHost): java.io.IOException: Mkdirs failed to create
>
> file:/some/shared/folder/my.parquet/_temporary/0/_temporary/attempt_201511251007_0012_m_04_0
> (exists=false,
>
> cwd=file:/project/mesos/work/slaves/8ef6b8ae-95fc-4963-b8c1-718edb988f3f-S7/frameworks/8ef6b8ae-95fc-4963-b8c1-718edb988f3f-0085/executors/0/runs/efd17cf8-30eb-4981-8d61-4dbb36a27dc7)
> at
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442)
> at
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
> at
>
> org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:176)
> at
>
> org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:160)
> at
>
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:289)
> at
>
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
> at
>
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetRelation.scala:94)
> at
>
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:272)
> at
>
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:233)
> at
>
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at
>
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> The remoteMachineHost that throws the error has write access to the
> specific
> folder.
> Any thoughts?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Mesos-with-Centos-6-6-NFS-tp25489.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Low Latency SQL query

2015-12-01 Thread ayan guha

You can try query push down by creating the query while creating the rdd.
On 2 Dec 2015 12:32, "Fengdong Yu"  wrote:

> It depends on many situations:
>
> 1) what’s your data format?  csv(text) or ORC/parquet?
> 2) Did you have Data warehouse to summary/cluster  your data?
>
>
> if your data is text or you query for the raw data, It should be slow,
> Spark cannot do much to optimize your job.
>
>
>
>
> On Dec 2, 2015, at 9:21 AM, Andrés Ivaldi  wrote:
>
> Mark, We have an application that use data from different kind of source,
> and we build a engine able to handle that, but cant scale with big data(we
> could but is to time expensive), and doesn't have Machine learning module,
> etc, we came across with Spark and it's looks like it have all we need,
> actually it does, but our latency is very low right now, and when we do
> some testing it took too long time for the same kind of results, always
> against RDBM which is our primary source.
>
> So, we want to expand our sources, to cvs, web service, big data, etc, we
> can extend our engine or use something like Spark, which give as power of
> clustering, different kind of source access, streaming, machine learning,
> easy extensibility and so on.
>
> On Tue, Dec 1, 2015 at 9:36 PM, Mark Hamstra 
> wrote:
>
>> I'd ask another question first: If your SQL query can be executed in a
>> performant fashion against a conventional (RDBMS?) database, why are you
>> trying to use Spark?  How you answer that question will be the key to
>> deciding among the engineering design tradeoffs to effectively use Spark or
>> some other solution.
>>
>> On Tue, Dec 1, 2015 at 4:23 PM, Andrés Ivaldi  wrote:
>>
>>> Ok, so latency problem is being generated because I'm using SQL as
>>> source? how about csv, hive, or another source?
>>>
>>> On Tue, Dec 1, 2015 at 9:18 PM, Mark Hamstra 
>>> wrote:
>>>
 It is not designed for interactive queries.


 You might want to ask the designers of Spark, Spark SQL, and
 particularly some things built on top of Spark (such as BlinkDB) about
 their intent with regard to interactive queries.  Interactive queries are
 not the only designed use of Spark, but it is going too far to claim that
 Spark is not designed at all to handle interactive queries.

 That being said, I think that you are correct to question the wisdom of
 expecting lowest-latency query response from Spark using SQL (sic,
 presumably a RDBMS is intended) as the datastore.

 On Tue, Dec 1, 2015 at 4:05 PM, Jörn Franke 
 wrote:

> Hmm it will never be faster than SQL if you use SQL as an underlying
> storage. Spark is (currently) an in-memory batch engine for iterative
> machine learning workloads. It is not designed for interactive queries.
> Currently hive is going into the direction of interactive queries.
> Alternatives are Hbase on Phoenix or Impala.
>
> On 01 Dec 2015, at 21:58, Andrés Ivaldi  wrote:
>
> Yes,
> The use case would be,
> Have spark in a service (I didnt invertigate this yet), through api
> calls of this service we perform some aggregations over data in SQL, We 
> are
> already doing this with an internal development
>
> Nothing complicated, for instance, a table with Product, Product
> Family, cost, price, etc. Columns like Dimension and Measures,
>
> I want to Spark for query that table to perform a kind of rollup, with
> cost as Measure and Prodcut, Product Family as Dimension
>
> Only 3 columns, it takes like 20s to perform that query and the
> aggregation, the  query directly to the database with a grouping at the
> columns takes like 1s
>
> regards
>
>
>
> On Tue, Dec 1, 2015 at 5:38 PM, Jörn Franke 
> wrote:
>
>> can you elaborate more on the use case?
>>
>> > On 01 Dec 2015, at 20:51, Andrés Ivaldi  wrote:
>> >
>> > Hi,
>> >
>> > I'd like to use spark to perform some transformations over data
>> stored inSQL, but I need low Latency, I'm doing some test and I run into
>> spark context creation and data query over SQL takes too long time.
>> >
>> > Any idea for speed up the process?
>> >
>> > regards.
>> >
>> > --
>> > Ing. Ivaldi Andres
>>
>
>
>
> --
> Ing. Ivaldi Andres
>
>

>>>
>>>
>>> --
>>> Ing. Ivaldi Andres
>>>
>>
>>
>
>
> --
> Ing. Ivaldi Andres
>
>
>

Graphx: How to print the group of connected components one by one

2015-12-01 Thread Zhang, Jingyu

Can anyone please let me know How to print all nodes in connected
components one by one?

graph.connectedComponents()

e.g.

connected Component ID  Nodes ID

1  1,2,3

6   6,7,8,9


Thanks

-- 
This message and its attachments may contain legally privileged or 
confidential information. It is intended solely for the named addressee. If 
you are not the addressee indicated in this message or responsible for 
delivery of the message to the addressee, you may not copy or deliver this 
message or its attachments to anyone. Rather, you should permanently delete 
this message and its attachments and kindly notify the sender by reply 
e-mail. Any content of this message and its attachments which does not 
relate to the official business of the sending company must be taken not to 
have been sent or endorsed by that company or any of its related entities. 
No warranty is made that the e-mail or attachments are free from computer 
virus or other defect.

Re: Can Spark Execute Hive Update/Delete operations

2015-12-01 Thread 张炜

Hello Ted and all,
We are using Hive 1.2.1 and Spark 1.5.1
I also noticed that there are other users reporting this problem.
http://apache-spark-user-list.1001560.n3.nabble.com/Issue-with-spark-on-hive-td25372.html#a25486
Thanks a lot for help!

Regards,
Sai

On Wed, Dec 2, 2015 at 11:11 AM Ted Yu  wrote:

> Can you tell us the version of Spark and hive you use ?
>
> Thanks
>
> On Tue, Dec 1, 2015 at 7:08 PM, 张炜  wrote:
>
>> Dear all,
>> We have a requirement that needs to update delete records in hive. These
>> operations are available in hive now.
>>
>> But when using hiveContext in Spark, it always pops up an "not supported"
>> error.
>> Is there anyway to support update/delete operations using spark?
>>
>> Regards,
>> Sai
>>
>
>

Re: Question about yarn-cluster mode and spark.driver.allowMultipleContexts

2015-12-01 Thread Marcelo Vanzin

On Tue, Dec 1, 2015 at 3:32 PM, Anfernee Xu  wrote:
> I have a long running backend server where I will create a short-lived Spark
> job in response to each user request, base on the fact that by default
> multiple Spark Context cannot be created in the same JVM, looks like I have
> 2 choices
>
> 2) run my jobs in yarn-cluster mode instead yarn-client

There's nothing in yarn-client mode that prevents you from doing what
you describe. If you write some server for users to submit jobs to, it
should work whether you start the context in yarn-client or
yarn-cluster mode. It just might be harder to find out where it's
running if you do it in cluster mode.

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Recovery for Spark Streaming Kafka Direct in case of issues with Kafka

2015-12-01 Thread Dibyendu Bhattacharya

Hi, if you use Receiver based consumer which is available in spark-packages
( http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) , this
has all built in failure recovery and it can recover from any Kafka leader
changes and offset out of ranges issue.

Here is the package form github :
https://github.com/dibbhatt/kafka-spark-consumer


Dibyendu

On Wed, Dec 2, 2015 at 5:28 AM, swetha kasireddy 
wrote:

> How to avoid those Errors with receiver based approach? Suppose we are OK
> with at least once processing and use receiver based approach which uses
> ZooKeeper but not query Kafka directly, would these errors(Couldn't find
> leader offsets for
> Set([test_stream,5])))be avoided?
>
> On Tue, Dec 1, 2015 at 3:40 PM, Cody Koeninger  wrote:
>
>> KafkaRDD.scala , handleFetchErr
>>
>> On Tue, Dec 1, 2015 at 3:39 PM, swetha kasireddy <
>> swethakasire...@gmail.com> wrote:
>>
>>> Hi Cody,
>>>
>>> How to look at Option 2(see the following)? Which portion of the code in
>>> Spark Kafka Direct to look at to handle this issue specific to our
>>> requirements.
>>>
>>>
>>> 2.Catch that exception and somehow force things to "reset" for that
>>> partition And how would it handle the offsets already calculated in the
>>> backlog (if there is one)?
>>>
>>> On Tue, Dec 1, 2015 at 6:51 AM, Cody Koeninger 
>>> wrote:
>>>
 If you're consistently getting offset out of range exceptions, it's
 probably because messages are getting deleted before you've processed them.

 The only real way to deal with this is give kafka more retention,
 consume faster, or both.

 If you're just looking for a quick "fix" for an infrequent issue,
 option 4 is probably easiest.  I wouldn't do that automatically / silently,
 because you're losing data.

 On Mon, Nov 30, 2015 at 6:22 PM, SRK  wrote:

> Hi,
>
> So, our Streaming Job fails with the following errors. If you see the
> errors
> below, they are all related to Kafka losing offsets and
> OffsetOutOfRangeException.
>
> What are the options we have other than fixing Kafka? We would like to
> do
> something like the following. How can we achieve 1 and 2 with Spark
> Kafka
> Direct?
>
> 1.Need to see a way to skip some offsets if they are not available
> after the
> max retries are reached..in that case there might be data loss.
>
> 2.Catch that exception and somehow force things to "reset" for that
> partition And how would it handle the offsets already calculated in the
> backlog (if there is one)?
>
> 3.Track the offsets separately, restart the job by providing the
> offsets.
>
> 4.Or a straightforward approach would be to monitor the log for this
> error,
> and if it occurs more than X times, kill the job, remove the checkpoint
> directory, and restart.
>
> ERROR DirectKafkaInputDStream:
> ArrayBuffer(kafka.common.UnknownException,
> org.apache.spark.SparkException: Couldn't find leader offsets for
> Set([test_stream,5]))
>
>
>
> java.lang.ClassNotFoundException:
> kafka.common.NotLeaderForPartitionException
>
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
>
>
>
> java.util.concurrent.RejectedExecutionException: Task
>
> org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler@a48c5a8
> rejected from java.util.concurrent.ThreadPoolExecutor@543258e0
> [Terminated,
> pool size = 0, active threads = 0, queued tasks = 0, completed tasks =
> 12112]
>
>
>
> org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 10
> in stage 52.0 failed 4 times, most recent failure: Lost task 10.3 in
> stage
> 52.0 (TID 255, 172.16.97.97): UnknownReason
>
> Exception in thread "streaming-job-executor-0" java.lang.Error:
> java.lang.InterruptedException
>
> Caused by: java.lang.InterruptedException
>
> java.lang.ClassNotFoundException:
> kafka.common.OffsetOutOfRangeException
>
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
>
>
>
> org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 7 in
> stage 33.0 failed 4 times, most recent failure: Lost task 7.3 in stage
> 33.0
> (TID 283, 172.16.97.103): UnknownReason
>
> java.lang.ClassNotFoundException:
> kafka.common.OffsetOutOfRangeException
>
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
>
> java.lang.ClassNotFoundException:
> kafka.common.OffsetOutOfRangeException
>
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
>
>
>
> --
> View this message in context:
>

Retrieving the PCA parameters in pyspark

2015-12-01 Thread Rohit Girdhar

Hi

I'm using PCA through the python interface for spark, as per the
instructions on this page:
https://spark.apache.org/docs/1.5.1/ml-features.html#pca

It works fine for learning the parameters and transforming the data.
However, I'm unable to find a way to retrieve the learnt PCA parameters. I
tried using model.params, but that returns an empty list. Is there some way
to retrieve these parameters (maybe through the py4j interface)?

Thanks
Rohit

Re: Spark Expand Cluster

2015-12-01 Thread Alexander Pivovarov

Try to run spark shell with correct number of executors

e.g. for 10 box cluster running on r3.2xlarge (61 RAM, 8 cores) you can use
the following

spark-shell \
--num-executors 20 \
--driver-memory 2g \
--executor-memory 24g \
--executor-cores 4


you might also want to set spark.yarn.executor.memoryOverhead to 2662





On Tue, Nov 24, 2015 at 2:07 AM, Dinesh Ranganathan <
dineshranganat...@gmail.com> wrote:

> Thanks Christopher, I will try that.
>
> Dan
>
> On 20 November 2015 at 21:41, Bozeman, Christopher 
> wrote:
>
>> Dan,
>>
>>
>>
>> Even though you may be adding more nodes to the cluster, the Spark
>> application has to be requesting additional executors in order to thus use
>> the added resources.  Or the Spark application can be using Dynamic
>> Resource Allocation (
>> http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation)
>> [which may use the resources based on application need and availability].
>> For example, in EMR release 4.x (
>> http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html#spark-dynamic-allocation)
>> you can request Spark Dynamic Resource Allocation as the default
>> configuration at cluster creation.
>>
>>
>>
>> Best regards,
>>
>> Christopher
>>
>>
>>
>>
>>
>> *From:* Dinesh Ranganathan [mailto:dineshranganat...@gmail.com]
>> *Sent:* Monday, November 16, 2015 4:57 AM
>> *To:* Sabarish Sasidharan
>> *Cc:* user
>> *Subject:* Re: Spark Expand Cluster
>>
>>
>>
>> Hi Sab,
>>
>>
>>
>> I did not specify number of executors when I submitted the spark
>> application. I was in the impression spark looks at the cluster and figures
>> out the number of executors it can use based on the cluster size
>> automatically, is this what you call dynamic allocation?. I am spark
>> newbie, so apologies if I am missing the obvious. While the application was
>> running I added more core nodes by resizing my EMR instance and I can see
>> the new nodes on the resource manager but my running application did not
>> pick up those machines I've just added.   Let me know If i am missing a
>> step here.
>>
>>
>>
>> Thanks,
>>
>> Dan
>>
>>
>>
>> On 16 November 2015 at 12:38, Sabarish Sasidharan <
>> sabarish.sasidha...@manthan.com> wrote:
>>
>> Spark will use the number of executors you specify in spark-submit. Are
>> you saying that Spark is not able to use more executors after you modify it
>> in spark-submit? Are you using dynamic allocation?
>>
>>
>>
>> Regards
>>
>> Sab
>>
>>
>>
>> On Mon, Nov 16, 2015 at 5:54 PM, dineshranganathan <
>> dineshranganat...@gmail.com> wrote:
>>
>> I have my Spark application deployed on AWS EMR on yarn cluster mode.
>> When I
>> increase the capacity of my cluster by adding more Core instances on AWS,
>> I
>> don't see Spark picking up the new instances dynamically. Is there
>> anything
>> I can do to tell Spark to pick up the newly added boxes??
>>
>> Dan
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Expand-Cluster-tp25393.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>>
>>
>>
>> --
>>
>>
>>
>> Architect - Big Data
>>
>> Ph: +91 99805 99458
>>
>>
>>
>> Manthan Systems | *Company of the year - Analytics (2014 Frost and
>> Sullivan India ICT)*
>>
>> +++
>>
>>
>>
>>
>>
>> --
>>
>> Dinesh Ranganathan
>>
>
>
>
> --
> Dinesh Ranganathan
>

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-01 Thread Jeff Zhang

I don't think there's api for that, but think it is reasonable and helpful
for ETL.

As a workaround you can first register your dataframe as temp table, and
use sql to insert to the static partition.

On Wed, Dec 2, 2015 at 10:50 AM, Isabelle Phan  wrote:

> Hello,
>
> Is there any API to insert data into a single partition of a table?
>
> Let's say I have a table with 2 columns (col_a, col_b) and a partition by
> date.
> After doing some computation for a specific date, I have a DataFrame with
> 2 columns (col_a, col_b) which I would like to insert into a specific date
> partition. What is the best way to achieve this?
>
> It seems that if I add a date column to my DataFrame, and turn on dynamic
> partitioning, I can do:
> df.write.partitionBy("date").insertInto("my_table")
> But it seems overkill to use dynamic partitioning function for such a case.
>
>
> Thanks for any pointers!
>
> Isabelle
>
>
>


-- 
Best Regards

Jeff Zhang

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-01 Thread Fengdong Yu

Hi
you can try:

if your table under location “/test/table/“ on HDFS
and has partitions:

 “/test/table/dt=2012”
 “/test/table/dt=2013”

df.write.mode(SaveMode.Append).partitionBy("date”).save(“/test/table")



> On Dec 2, 2015, at 10:50 AM, Isabelle Phan  wrote:
> 
> df.write.partitionBy("date").insertInto("my_table")

Spark Streaming - History UI

2015-12-01 Thread patcharee


Hi,

On my history server UI, I cannot see "streaming" tab for any streaming 
jobs? I am using version 1.5.1. Any ideas?


Thanks,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Increasing memory usage on batch job (pyspark)

2015-12-01 Thread Aaron Jackson

Greetings,

I am processing a "batch" of files and have structured an iterative process
around them. Each batch is processed by first loading the data with
spark-csv, performing some minor transformations and then writing back out
as parquet.  Absolutely no caching or shuffle should occur with anything in
this process.

I watch memory utilization on each executor and I notice a steady increase
in memory with each iteration that completes.  Eventually, we reach the
memory limit set for the executor and the process begins to slowly degrade
and fail.

I'm really unclear about what I am doing that could possibly be causing the
executors to hold on to state between iterations.  Again, I was careful to
make sure there was no caching that occurred.  I've done most of my testing
to date in python, though I will port it to scala to see if the behavior is
potentially isolated to the runtime.

Spark: 1.5.2

~~ Ajaxx

Re: Question about yarn-cluster mode and spark.driver.allowMultipleContexts

2015-12-01 Thread Anfernee Xu

Thanks Marcelo,

But I have a single server(JVM) that is creating SparkContext, are you
saying Spark supports multiple SparkContext in the same JVM? Could you
please clarify on this?

Thanks

Anfernee

On Tue, Dec 1, 2015 at 8:14 PM, Marcelo Vanzin  wrote:

> On Tue, Dec 1, 2015 at 3:32 PM, Anfernee Xu  wrote:
> > I have a long running backend server where I will create a short-lived
> Spark
> > job in response to each user request, base on the fact that by default
> > multiple Spark Context cannot be created in the same JVM, looks like I
> have
> > 2 choices
> >
> > 2) run my jobs in yarn-cluster mode instead yarn-client
>
> There's nothing in yarn-client mode that prevents you from doing what
> you describe. If you write some server for users to submit jobs to, it
> should work whether you start the context in yarn-client or
> yarn-cluster mode. It just might be harder to find out where it's
> running if you do it in cluster mode.
>
> --
> Marcelo
>



-- 
--Anfernee

Re: Grid search with Random Forest

2015-12-01 Thread Ndjido Ardo BAR

Thanks for the clarification. Gonna test that and give you feedbacks.

Ndjido
On Tue, 1 Dec 2015 at 19:29, Joseph Bradley  wrote:

> You can do grid search if you set the evaluator to a
> MulticlassClassificationEvaluator, which expects a prediction column, not a
> rawPrediction column.  There's a JIRA for making
> BinaryClassificationEvaluator accept prediction instead of rawPrediction.
> Joseph
>
> On Tue, Dec 1, 2015 at 5:10 AM, Benjamin Fradet  > wrote:
>
>> Someone correct me if I'm wrong but no there isn't one that I am aware of.
>>
>> Unless someone is willing to explain how to obtain the raw prediction
>> column with the GBTClassifier. In this case I'd be happy to work on a PR.
>> On 1 Dec 2015 8:43 a.m., "Ndjido Ardo BAR"  wrote:
>>
>>> Hi Benjamin,
>>>
>>> Thanks, the documentation you sent is clear.
>>> Is there any other way to perform a Grid Search with GBT?
>>>
>>>
>>> Ndjido
>>> On Tue, 1 Dec 2015 at 08:32, Benjamin Fradet 
>>> wrote:
>>>
 Hi Ndjido,

 This is because GBTClassifier doesn't yet have a rawPredictionCol like
 the. RandomForestClassifier has.
 Cf:
 http://spark.apache.org/docs/latest/ml-ensembles.html#output-columns-predictions-1
 On 1 Dec 2015 3:57 a.m., "Ndjido Ardo BAR"  wrote:

> Hi Joseph,
>
> Yes Random Forest support Grid Search on Spark 1.5.+ . But I'm getting
> a "rawPredictionCol field does not exist exception" on Spark 1.5.2 for
> Gradient Boosting Trees classifier.
>
>
> Ardo
> On Tue, 1 Dec 2015 at 01:34, Joseph Bradley 
> wrote:
>
>> It should work with 1.5+.
>>
>> On Thu, Nov 26, 2015 at 12:53 PM, Ndjido Ardo Bar 
>> wrote:
>>
>>>
>>> Hi folks,
>>>
>>> Does anyone know whether the Grid Search capability is enabled since
>>> the issue spark-9011 of version 1.4.0 ? I'm getting the 
>>> "rawPredictionCol
>>> column doesn't exist" when trying to perform a grid search with Spark 
>>> 1.4.0.
>>>
>>> Cheers,
>>> Ardo
>>>
>>>
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Getting all files of a table

2015-12-01 Thread Krzysztof Zarzycki

Hi there,
Do you know how easily I can get a list of all files of a Hive table?

What I want to achieve is to get all files that are underneath parquet
table and using sparksql-protobuf[1] library(really handy library!) and its
helper class ProtoParquetRDD:

val protobufsRdd = new ProtoParquetRDD(sc, "files", classOf[MyProto])

Access the underlying parquet files as normal protocol buffers. But I don't
know how to get them. I pointed the call above to one file by hand it
worked well.
The parquet table was created based on the same library and it's implicit
hiveContext extension createDataFrame, which creates a DataFrame based on
Protocol buffer class.

(The revert read operation is needed to support legacy code, where after
converting protocol buffers to parquet, I still want some code to access
parquet as normal protocol buffers).

Maybe someone will have other way to get an Rdd of protocol buffers from
Parquet-stored table.

[1] https://github.com/saurfang/sparksql-protobuf

Thanks!
Krzysztof

Re: Getting all files of a table

2015-12-01 Thread Michael Armbrust

sqlContext.table("...").inputFiles

(this is best effort, but should work for hive tables).

Michael

On Tue, Dec 1, 2015 at 10:55 AM, Krzysztof Zarzycki 
wrote:

> Hi there,
> Do you know how easily I can get a list of all files of a Hive table?
>
> What I want to achieve is to get all files that are underneath parquet
> table and using sparksql-protobuf[1] library(really handy library!) and its
> helper class ProtoParquetRDD:
>
> val protobufsRdd = new ProtoParquetRDD(sc, "files", classOf[MyProto])
>
> Access the underlying parquet files as normal protocol buffers. But I
> don't know how to get them. I pointed the call above to one file by hand it
> worked well.
> The parquet table was created based on the same library and it's implicit
> hiveContext extension createDataFrame, which creates a DataFrame based on
> Protocol buffer class.
>
> (The revert read operation is needed to support legacy code, where after
> converting protocol buffers to parquet, I still want some code to access
> parquet as normal protocol buffers).
>
> Maybe someone will have other way to get an Rdd of protocol buffers from
> Parquet-stored table.
>
> [1] https://github.com/saurfang/sparksql-protobuf
>
> Thanks!
> Krzysztof
>
>
>
>

Re: Help with type check

2015-12-01 Thread Eyal Sharon

Great, That works perfect !!
Also tnx for the links - very helpful

On Tue, Dec 1, 2015 at 12:13 AM, Jakob Odersky  wrote:

> Hi Eyal,
>
> what you're seeing is not a Spark issue, it is related to boxed types.
>
> I assume 'b' in your code is some kind of java buffer, where b.getDouble()
> returns an instance of java.lang.Double and not a scala.Double. Hence
> muCouch is an Array[java.lang.Double], an array containing boxed doubles.
>
> To fix your problem, change 'yield b.getDouble(i)' to 'yield
> b.getDouble(i).doubleValue'
>
> You might want to have a look at these too:
> -
> http://stackoverflow.com/questions/23821576/efficient-conversion-of-java-util-listjava-lang-double-to-scala-listdouble
> - https://docs.oracle.com/javase/7/docs/api/java/lang/Double.html
> - http://www.scala-lang.org/api/current/index.html#scala.Double
>
> On 30 November 2015 at 10:13, Eyal Sharon  wrote:
>
>> Hi ,
>>
>> I have problem with inferring what are the types bug here
>>
>> I have this code fragment . it parse Json to Array[Double]
>>
>>
>>
>>
>>
>>
>> *val muCouch = {  val e = input.filter( _.id=="mu")(0).content()  val b  = 
>> e.getArray("feature_mean")  for (i <- 0 to e.getInt("features") ) yield 
>> b.getDouble(i)}.toArray*
>>
>> Now the problem is when I want to create a dense vector  :
>>
>> *new DenseVector(muCouch)*
>>
>>
>> I get the following error :
>>
>>
>> *Error:(111, 21) type mismatch;
>>  found   : Array[java.lang.Double]
>>  required: Array[scala.Double] *
>>
>>
>> Now , I probably get a workaround for that , but I want to get a deeper 
>> understanding  on why it occurs
>>
>> p.s - I do use collection.JavaConversions._
>>
>> Thanks !
>>
>>
>>
>>
>> *This email and any files transmitted with it are confidential and
>> intended solely for the use of the individual or entity to whom they are
>> addressed. Please note that any disclosure, copying or distribution of the
>> content of this information is strictly forbidden. If you have received
>> this email message in error, please destroy it immediately and notify its
>> sender.*
>>
>
>

-- 


*This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are 
addressed. Please note that any disclosure, copying or distribution of the 
content of this information is strictly forbidden. If you have received 
this email message in error, please destroy it immediately and notify its 
sender.*

Re: Cant start master on windows 7

2015-12-01 Thread Jacek Laskowski

Neat! It should be part of the docs.

Do you also happen to know why people are seeing the weird networking
problems with unresolvable network addresses? It looks the issue is
"polluting" Spark mailing list/SO quite often. Why don't you run into
the issue? Did you change anything network-related? Thanks for any
help!

Pozdrawiam,
Jacek

--
Jacek Laskowski | https://medium.com/@jaceklaskowski/ |
http://blog.jaceklaskowski.pl
Mastering Spark https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
Follow me at https://twitter.com/jaceklaskowski
Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski


On Mon, Nov 30, 2015 at 10:44 PM, Tim Barthram  wrote:
> Hi Jacek,
>
> To run a spark master on my windows box, I've created a .bat file with 
> contents something like:
>
> .\bin\spark-class.cmd org.apache.spark.deploy.master.Master --host 
>
>
> For the worker:
>
> .\bin\spark-class.cmd org.apache.spark.deploy.worker.Worker 
> spark://:7077
>
>
> To wrap these in services, I've user yasw or nssm.
>
> Thanks,
> Tim
>
>
>
>
> -Original Message-
> From: Jacek Laskowski [mailto:ja...@japila.pl]
> Sent: Tuesday, 1 December 2015 4:18 AM
> To: Shuo Wang
> Cc: user
> Subject: Re: Cant start master on windows 7
>
> On Fri, Nov 27, 2015 at 4:27 PM, Shuo Wang  wrote:
>
>> I am trying to use the start-master.sh script on windows 7.
>
> From http://spark.apache.org/docs/latest/spark-standalone.html:
>
> "Note: The launch scripts do not currently support Windows. To run a
> Spark cluster on Windows, start the master and workers by hand."
>
> Can you start the command by hand? Just copy and paste the command
> from the logs. Mind the spaces!
>
> Jacek
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
> _
>
> The information transmitted in this message and its attachments (if any) is 
> intended
> only for the person or entity to which it is addressed.
> The message may contain confidential and/or privileged material. Any review,
> retransmission, dissemination or other use of, or taking of any action in 
> reliance
> upon this information, by persons or entities other than the intended 
> recipient is
> prohibited.
>
> If you have received this in error, please contact the sender and delete this 
> e-mail
> and associated material from any computer.
>
> The intended recipient of this e-mail may only use, reproduce, disclose or 
> distribute
> the information contained in this e-mail and any attached files, with the 
> permission
> of the sender.
>
> This message has been scanned for viruses.
> _

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: merge 3 different types of RDDs in one

2015-12-01 Thread Shams ul Haque

Hi Jacek,

Thanks for the suggestion, i am going to try union.
And what is your opinion on 2nd question.


Thanks
Shams

On Tue, Dec 1, 2015 at 3:23 PM, Jacek Laskowski  wrote:

> Hi,
>
> Never done it before, but just yesterday I found out about
> SparkContext.union method that could help in your case.
>
> def union[T](rdds: Seq[RDD[T]])(implicit arg0: ClassTag[T]): RDD[T]
>
>
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext
>
> Pozdrawiam,
> Jacek
>
> --
> Jacek Laskowski | https://medium.com/@jaceklaskowski/ |
> http://blog.jaceklaskowski.pl
> Mastering Spark https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
> Follow me at https://twitter.com/jaceklaskowski
> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
>
>
> On Tue, Dec 1, 2015 at 10:47 AM, Shams ul Haque 
> wrote:
> > Hi All,
> >
> > I have made 3 RDDs of 3 different dataset, all RDDs are grouped by
> > CustomerID in which 2 RDDs have value of Iterable type and one has signle
> > bean. All RDDs have id of Long type as CustomerId. Below are the model
> for 3
> > RDDs:
> > JavaPairRDD
> > JavaPairRDD
> > JavaPairRDD
> >
> > Now, i have to merge all these 3 RDDs as signle one so that i can
> generate
> > excel report as per each customer by using data in 3 RDDs.
> > As i tried to using Join Transformation but it needs RDDs of same type
> and
> > it works for two RDDs.
> > So my questions is,
> > 1. is there any way to done my merging task efficiently, so that i can
> get
> > all 3 dataset by CustomerId?
> > 2. If i merge 1st two using Join Transformation, then do i need to run
> > groupByKey() before Join so that all data related to single customer
> will be
> > on one node?
> >
> >
> > Thanks
> > Shams
>

Re: merge 3 different types of RDDs in one

2015-12-01 Thread Praveen Chundi


cogroup could be useful to you, since all three are PairRDD's.

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions

Best Regards,
Praveen


On 01.12.2015 10:47, Shams ul Haque wrote:

Hi All,

I have made 3 RDDs of 3 different dataset, all RDDs are grouped by 
CustomerID in which 2 RDDs have value of Iterable type and one has 
signle bean. All RDDs have id of Long type as CustomerId. Below are 
the model for 3 RDDs:

JavaPairRDD
JavaPairRDD
JavaPairRDD

Now, i have to merge all these 3 RDDs as signle one so that i can 
generate excel report as per each customer by using data in 3 RDDs.
As i tried to using Join Transformation but it needs RDDs of same type 
and it works for two RDDs.

So my questions is,
1. is there any way to done my merging task efficiently, so that i can 
get all 3 dataset by CustomerId?
2. If i merge 1st two using Join Transformation, then do i need to run 
groupByKey() before Join so that all data related to single customer 
will be on one node?



Thanks
Shams



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: merge 3 different types of RDDs in one

2015-12-01 Thread Sonal Goyal

I think you should be able to join different  rdds with same key. Have you
tried that?
On Dec 1, 2015 3:30 PM, "Praveen Chundi"  wrote:

> cogroup could be useful to you, since all three are PairRDD's.
>
>
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
>
> Best Regards,
> Praveen
>
>
> On 01.12.2015 10:47, Shams ul Haque wrote:
>
>> Hi All,
>>
>> I have made 3 RDDs of 3 different dataset, all RDDs are grouped by
>> CustomerID in which 2 RDDs have value of Iterable type and one has signle
>> bean. All RDDs have id of Long type as CustomerId. Below are the model for
>> 3 RDDs:
>> JavaPairRDD
>> JavaPairRDD
>> JavaPairRDD
>>
>> Now, i have to merge all these 3 RDDs as signle one so that i can
>> generate excel report as per each customer by using data in 3 RDDs.
>> As i tried to using Join Transformation but it needs RDDs of same type
>> and it works for two RDDs.
>> So my questions is,
>> 1. is there any way to done my merging task efficiently, so that i can
>> get all 3 dataset by CustomerId?
>> 2. If i merge 1st two using Join Transformation, then do i need to run
>> groupByKey() before Join so that all data related to single customer will
>> be on one node?
>>
>>
>> Thanks
>> Shams
>>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: merge 3 different types of RDDs in one

2015-12-01 Thread Sushrut Ikhar

Hi,
I have myself used union in a similar case. And applied reduceByKey on it.
Union + reduceByKey will suffice join... but you will have to first use Map
so that all values are of same datatype

Regards,

Sushrut Ikhar
[image: https://]about.me/sushrutikhar



On Tue, Dec 1, 2015 at 3:34 PM, Sonal Goyal  wrote:

> I think you should be able to join different  rdds with same key. Have you
> tried that?
> On Dec 1, 2015 3:30 PM, "Praveen Chundi"  wrote:
>
>> cogroup could be useful to you, since all three are PairRDD's.
>>
>>
>> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
>>
>> Best Regards,
>> Praveen
>>
>>
>> On 01.12.2015 10:47, Shams ul Haque wrote:
>>
>>> Hi All,
>>>
>>> I have made 3 RDDs of 3 different dataset, all RDDs are grouped by
>>> CustomerID in which 2 RDDs have value of Iterable type and one has signle
>>> bean. All RDDs have id of Long type as CustomerId. Below are the model for
>>> 3 RDDs:
>>> JavaPairRDD
>>> JavaPairRDD
>>> JavaPairRDD
>>>
>>> Now, i have to merge all these 3 RDDs as signle one so that i can
>>> generate excel report as per each customer by using data in 3 RDDs.
>>> As i tried to using Join Transformation but it needs RDDs of same type
>>> and it works for two RDDs.
>>> So my questions is,
>>> 1. is there any way to done my merging task efficiently, so that i can
>>> get all 3 dataset by CustomerId?
>>> 2. If i merge 1st two using Join Transformation, then do i need to run
>>> groupByKey() before Join so that all data related to single customer will
>>> be on one node?
>>>
>>>
>>> Thanks
>>> Shams
>>>
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>

Re: dfs.blocksize is not applicable to some cases

2015-12-01 Thread Jung

I get additional information. Second case work normally if I set dfs.blocksize 
in hdfs-site.xml to 512MB and restart all namenode and datanodes.

  231.0 M  
/user/hive/warehouse/partition_test3/part-r-0-d2e4ee9e-0a5f-4ee1-b511-88848a7a92d4.gz.parquet
  
/user/hive/warehouse/partition_test3/part-r-0-d2e4ee9e-0a5f-4ee1-b511-88848a7a92d4.gz.parquet
 242202275 bytes, 1 block(s):  OK

It seems dfs.blocksize from sc.hadoopConfiguration get ignored in somewhere 
when the parent RDD is managed table or parquet type.

-Original Message-
From: "Jung" 
To: "Ted Yu"; ; 
Cc: 
Sent: 2015-12-01 (화) 10:22:25
Subject: Re: dfs.blocksize is not applicable to some cases
 
Yes, I can reproduce it in Spark 1.5.2.
This is the results.

1. first case(1block)
  221.1 M  
/user/hive/warehouse/partition_test/part-r-0-b0e5ecd3-75a3-4c92-94ec-59353d08067a.gz.parquet
  221.1 M  
/user/hive/warehouse/partition_test/part-r-1-b0e5ecd3-75a3-4c92-94ec-59353d08067a.gz.parquet
  221.1 M  
/user/hive/warehouse/partition_test/part-r-2-b0e5ecd3-75a3-4c92-94ec-59353d08067a.gz.parquet

  
/user/hive/warehouse/partition_test/part-r-0-b0e5ecd3-75a3-4c92-94ec-59353d08067a.gz.parquet
 231863863 bytes, 1 block(s):  OK

2. second case(2blocks)
   231.0 M  
/user/hive/warehouse/partition_test2/part-r-0-b7486a52-cfb9-4db0-8d94-377c039026ef.gz.parquet
  
  
/user/hive/warehouse/partition_test2/part-r-0-b7486a52-cfb9-4db0-8d94-377c039026ef.gz.parquet
 242201812 bytes, 2 block(s):  OK

In terms of PARQUET-166, I think it only discusses row group performance. 
Should I set dfs.blocksize to a little bit more than parquet.block.size? 

Thanks

-Original Message-
From: "Ted Yu" 
To: "Jung"; 
Cc: "user"; 
Sent: 2015-12-01 (화) 03:09:58
Subject: Re: dfs.blocksize is not applicable to some cases
 
I am not expert in Parquet. Looking at PARQUET-166, it seems that 
parquet.block.size should be lower than dfs.blocksize Have you tried Spark 
1.5.2 to see if the problem persists ? Cheers
On Mon, Nov 30, 2015 at 1:55 AM, Jung  wrote:
Hello,
I use Spark 1.4.1 and Hadoop 2.2.0.
It may be a stupid question but I cannot understand why "dfs.blocksize" in 
hadoop option doesn't affect the number of blocks sometimes.
When I run the script below,

  val BLOCK_SIZE = 1024 * 1024 * 512 // set to 512MB, hadoop default is 128MB
  sc.hadoopConfiguration.setInt("parquet.block.size", BLOCK_SIZE)
  sc.hadoopConfiguration.setInt("dfs.blocksize",BLOCK_SIZE)
  sc.parallelize(1 to 5, 
24).repartition(3).toDF.saveAsTable("partition_test")

it creates 3 files like this.

  221.1 M  /user/hive/warehouse/partition_test/part-r-1.gz.parquet
  221.1 M  /user/hive/warehouse/partition_test/part-r-2.gz.parquet
  221.1 M  /user/hive/warehouse/partition_test/part-r-3.gz.parquet

To check how many blocks in a file, I enter the command "hdfs fsck 
/user/hive/warehouse/partition_test/part-r-1.gz.parquet -files -blocks".

  Total blocks (validated):  1 (avg. block size 231864402 B)

It is normal case because maximum blocksize change from 128MB to 512MB.
In the real world, I have a bunch of files.

  14.4 M  /user/hive/warehouse/data_1g/part-r-1.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-2.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-3.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-4.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-5.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-6.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-7.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-8.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-9.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-00010.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-00011.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-00012.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-00013.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-00014.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-00015.gz.parquet
  14.4 M  /user/hive/warehouse/data_1g/part-r-00016.gz.parquet

Each file consists of 1block (avg. block size 15141395 B) and I run the almost 
same code as first.

  val BLOCK_SIZE = 1024 * 1024 * 512 // set to 512MB, hadoop default is 128MB
  sc.hadoopConfiguration.setInt("parquet.block.size", BLOCK_SIZE)
  sc.hadoopConfiguration.setInt("dfs.blocksize",BLOCK_SIZE)
  sqlContext.table("data_1g").repartition(1).saveAsTable("partition_test2")

It creates one file.

 231.0 M  /user/hive/warehouse/partition_test2/part-r-1.gz.parquet

But it consists of 2 blocks. It seems dfs.blocksize is not applicable.

  /user/hive/warehouse/partition_test2/part-r-1.gz.parquet 242202143 bytes, 
2 block(s):  OK
  0. BP-2098986396-192.168.100.1-1389779750403:blk_1080124727_6385839

Re: Failing to execute Pregel shortest path on 22k nodes

2015-12-01 Thread Robineast

1. The for loop is executed in your driver program so will send each Pregel
request serially to be executed on the cluster
2. Whilst caching/persisting may improve the runtime it shouldn't affect the
memory bounds - if you ask to cache more than is available then cached RDDs
will be dropped out of the cache. How are you running the program? via
spark-submit - if so what parameters are you using?




-
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Failing-to-execute-Pregel-shortest-path-on-22k-nodes-tp25528p25531.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark streaming job hangs

2015-12-01 Thread Paul Leclercq

You might not have enough cores to process data from Kafka


> When running a Spark Streaming program locally, do not use “local” or
> “local[1]” as the master URL. Either of these means that only one thread
> will be used for running tasks locally. If you are using a input DStream
> based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single
> thread will be used to run the receiver, leaving no thread for processing
> the received data. *Hence, when running locally, always use “local[n]” as
> the master URL, *where n > number of receivers to run (see Spark
> Properties for information on how to set the master).*


 
https://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers


2015-12-01 7:13 GMT+01:00 Cassa L :

> Hi,
>  I am reading data from Kafka into spark. It runs fine for sometime but
> then hangs forever with following output. I don't see and errors in logs.
> How do I debug this?
>
> 2015-12-01 06:04:30,697 [dag-scheduler-event-loop] INFO
> (Logging.scala:59) - Adding task set 19.0 with 4 tasks
> 2015-12-01 06:04:30,872 [pool-13-thread-1] INFO  (Logging.scala:59) -
> Disconnected from Cassandra cluster: APG DEV Cluster
> 2015-12-01 06:04:35,060 [JobGenerator] INFO  (Logging.scala:59) - Added
> jobs for time 1448949875000 ms
> 2015-12-01 06:04:40,054 [JobGenerator] INFO  (Logging.scala:59) - Added
> jobs for time 144894988 ms
> 2015-12-01 06:04:45,034 [JobGenerator] INFO  (Logging.scala:59) - Added
> jobs for time 1448949885000 ms
> 2015-12-01 06:04:50,100 [JobGenerator] INFO  (Logging.scala:59) - Added
> jobs for time 144894989 ms
> 2015-12-01 06:04:55,064 [JobGenerator] INFO  (Logging.scala:59) - Added
> jobs for time 1448949895000 ms
> 2015-12-01 06:05:00,125 [JobGenerator] INFO  (Logging.scala:59) - Added
> jobs for time 144894990 ms
>
>
> Thanks
> LCassa
>



-- 

Paul Leclercq | Data engineer


 paul.lecle...@tabmo.io  |  http://www.tabmo.fr/

Re: merge 3 different types of RDDs in one

2015-12-01 Thread Jacek Laskowski

On Tue, Dec 1, 2015 at 10:57 AM, Shams ul Haque  wrote:

> Thanks for the suggestion, i am going to try union.

...and please report your findings back.

> And what is your opinion on 2nd question.

Dunno. If you find a solution, let us know.

Jacek

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

New to Spark

2015-12-01 Thread Ashok Kumar


  Hi,
I am new to Spark.
I am trying to use spark-sql with SPARK CREATED and HIVE CREATED tables.
I have successfully made Hive metastore to be used by Spark.
In spark-sql I can see the DDL for Hive tables. However, when I do select 
count(1) from HIVE_TABLE it always returns zero rows.
If I create a table in spark as create table SPARK_TABLE as select * from 
HIVE_TABLE, the table schema is created but no data.
I can then use Hive to do INSET SELECT from HIVE_TABLE into SPARK_TABLE. That 
works.
I can then use spark-sql to query the table.
My questions:
   
   - Is this correct that spark-sql only sees data in spark created tables but 
not any data in Hive tables?
   - How can I make Spark read data from existing Hive tables.


Thanks

Re: Scala 2.11 and Akka 2.4.0

2015-12-01 Thread Ted Yu

I don't see 2.4.0 release under:
http://mvnrepository.com/artifact/com.typesafe.akka/akka-remote_2.10

Probably that was the cause for the 'Could not find artifact' error.

On Tue, Dec 1, 2015 at 7:03 AM, Boavida, Rodrigo  wrote:

> Hi Ted,
>
> Thanks for getting back to me and for the suggestion.
>
> Running a 'mvn dependency:tree' I get the following:
>
> [ERROR] Failed to execute goal on project spark-core_2.11: Could not
> resolve dependencies for project
> org.apache.spark:spark-core_2.11:jar:1.5.2: The following artifacts could
> not be resolved: com.typesafe.akka:akka-remote_2.10:jar:2.4.0,
> com.typesafe.akka:akka-slf4j_2.10:jar:2.4.0,
> com.typesafe.akka:akka-testkit_2.10:jar:2.4.0: Could not find artifact
> com.typesafe.akka:akka-remote_2.10:jar:2.4.0 in central (
> https://repo1.maven.org/maven2) -> [Help 1]
>
> So it seems somehow it's still pulling some 2.10 dependencies. Do you
> think this could be the cause for the observed problem?
>
> tnks,
> Rod
>
> -Original Message-
> From: Ted Yu [mailto:yuzhih...@gmail.com]
> Sent: 01 December 2015 14:13
> To: Boavida, Rodrigo 
> Cc: user@spark.apache.org
> Subject: Re: Scala 2.11 and Akka 2.4.0
>
> Have you run 'mvn dependency:tree' and examined the output ?
>
> There should be some hint.
>
> Cheers
>
> > On Dec 1, 2015, at 5:32 AM, RodrigoB  wrote:
> >
> > Hi,
> >
> > I'm currently trying to build spark with Scala 2.11 and Akka 2.4.0.
> > I've changed the main pom.xml files to corresponding akka version and
> > am getting the following exception when starting the master on
> standalone:
> >
> > Exception Details:
> >  Location:
> >akka/dispatch/Mailbox.processAllSystemMessages()V @152: getstatic
> >  Reason:
> >Type top (current frame, locals[9]) is not assignable to
> > 'akka/dispatch/sysmsg/SystemMessage' (stack map, locals[9])  Current
> > Frame:
> >bci: @131
> >flags: { }
> >locals: { 'akka/dispatch/Mailbox',
> > 'java/lang/InterruptedException',
> > 'akka/dispatch/sysmsg/SystemMessage', top, 'akka/dispatch/Mailbox',
> 'java/lang/Throwable', 'java/lang/Throwable' }
> >stack: { integer }
> >  Stackmap Frame:
> >bci: @152
> >flags: { }
> >locals: { 'akka/dispatch/Mailbox',
> > 'java/lang/InterruptedException',
> > 'akka/dispatch/sysmsg/SystemMessage', top, 'akka/dispatch/Mailbox',
> > 'java/lang/Throwable', 'java/lang/Throwable', top, top,
> 'akka/dispatch/sysmsg/SystemMessage' }
> >stack: { }
> >  Bytecode:
> >0x000: 014c 2ab2 0132 b601 35b6 0139 4db2 013e
> >0x010: 2cb6 0142 9900 522a b600 c69a 004b 2c4e
> >0x020: b201 3e2c b601 454d 2db9 0148 0100 2ab6
> >0x030: 0052 2db6 014b b801 0999 000e bb00 e759
> >0x040: 1301 4db7 010f 4cb2 013e 2cb6 0150 99ff
> >0x050: bf2a b600 c69a ffb8 2ab2 0132 b601 35b6
> >0x060: 0139 4da7 ffaa 2ab6 0052 b600 56b6 0154
> >0x070: b601 5a3a 04a7 0091 3a05 1905 3a06 1906
> >0x080: c100 e799 0015 1906 c000 e73a 0719 074c
> >0x090: b200 f63a 08a7 0071 b201 5f19 06b6 0163
> >0x0a0: 3a0a 190a b601 6899 0006 1905 bf19 0ab6
> >0x0b0: 016c c000 df3a 0b2a b600 52b6 0170 b601
> >0x0c0: 76bb 000f 5919 0b2a b600 52b6 017a b601
> >0x0d0: 80b6 0186 2ab6 018a bb01 8c59 b701 8e13
> >0x0e0: 0190 b601 9419 09b6 0194 1301 96b6 0194
> >0x0f0: 190b b601 99b6 0194 b601 9ab7 019d b601
> >0x100: a3b2 00f6 3a08 b201 3e2c b601 4299 0026
> >0x110: 2c3a 09b2 013e 2cb6 0145 4d19 09b9 0148
> >0x120: 0100 1904 2ab6 0052 b601 7a19 09b6 01a7
> >0x130: a7ff d62b c600 09b8 0109 572b bfb1  Exception Handler
> > Table:
> >bci [290, 307] => handler: 120
> >  Stackmap Table:
> >append_frame(@13,Object[#231],Object[#177])
> >append_frame(@71,Object[#177])
> >chop_frame(@102,1)
> >
> > full_frame(@120,{Object[#2],Object[#231],Object[#177],Top,Object[#2],O
> > bject[#177]},{Object[#223]})
> >
> >
> full_frame(@152,{Object[#2],Object[#231],Object[#177],Top,Object[#2],Object[#223],Object[#223],Top,Top,Object[#177]},{})
> >append_frame(@173,Object[#357])
> >
> > full_frame(@262,{Object[#2],Object[#231],Object[#177],Top,Object[#2]},{})
> >same_frame(@307)
> >same_frame(@317)
> >   at akka.dispatch.Mailboxes.(Mailboxes.scala:33)
> >at akka.actor.ActorSystemImpl.(ActorSystem.scala:635)
> >at akka.actor.ActorSystem$.apply(ActorSystem.scala:143)
> >at akka.actor.ActorSystem$.apply(ActorSystem.scala:120)
> >at
> >
> org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
> >at
> > org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
> >at
> > org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:52)
> >at
> >
> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1920)
> >at
>

Re: Scala 2.11 and Akka 2.4.0

2015-12-01 Thread Ted Yu

Please specify the following in your maven commands:
-Dscala-2.11

Cheers

RE: Scala 2.11 and Akka 2.4.0

2015-12-01 Thread Boavida, Rodrigo

Thanks that worked! I let you know the results.

Tnks,
Rod

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: 01 December 2015 15:36
To: Boavida, Rodrigo 
Cc: user@spark.apache.org
Subject: Re: Scala 2.11 and Akka 2.4.0

Please specify the following in your maven commands:
-Dscala-2.11

Cheers
This email (including any attachments) is proprietary to Aspect Software, Inc. 
and may contain information that is confidential. If you have received this 
message in error, please do not read, copy or forward this message. Please 
notify the sender immediately, delete it from your system and destroy any 
copies. You may not further disclose or distribute this email or its 
attachments.

Effective ways monitor and identify that a Streaming job has been failing for the last 5 minutes

2015-12-01 Thread SRK

Hi,

We need to monitor and identify if the Streaming job has been failing for
the last 5 minutes and restart the job accordingly.  In most cases our Spark
Streaming with Kafka direct fails with leader lost errors. Or offsets not
found errors for that partition. What is the most effective way to monitor
and identify that the Streamjng job has been failing with an error . The
default monitoring provided by Spark does not seem to cover the case to
check if the job has been failing for a specific time or am I missing
something and this feature is already available?

Thanks,
Swetha



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Effective-ways-monitor-and-identify-that-a-Streaming-job-has-been-failing-for-the-last-5-minutes-tp25536.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark and simulated annealing

2015-12-01 Thread marfago

HI,

Thank you for your suggestion.

Is the scipy library (in particular scipy.optimize.anneal function) still
able to leverage the parallelism and distributed calculus offered by Spark?

Marco



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-simulated-annealing-tp25507p25537.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Streaming Specify Kafka Partition

2015-12-01 Thread Alan Braithwaite

Neat, thanks.  If I specify something like -1 as the offset, will it
consume from the latest offset or do I have to instrument that manually?

- Alan

On Tue, Dec 1, 2015 at 6:43 AM, Cody Koeninger  wrote:

> Yes, there is a version of createDirectStream that lets you specify
> fromOffsets: Map[TopicAndPartition, Long]
>
> On Mon, Nov 30, 2015 at 7:43 PM, Alan Braithwaite 
> wrote:
>
>> Is there any mechanism in the kafka streaming source to specify the exact
>> partition id that we want a streaming job to consume from?
>>
>> If not, is there a workaround besides writing our a custom receiver?
>>
>> Thanks,
>> - Alan
>>
>
>

Re: Unable to get phoenix connection in spark job in secured cluster

2015-12-01 Thread Ted Yu

What are the versions for Spark / HBase / Phoenix you're using ?

Cheers

On Tue, Dec 1, 2015 at 4:15 AM, Akhilesh Pathodia <
pathodia.akhil...@gmail.com> wrote:

> Hi,
>
> I am running spark job on yarn in cluster mode in secured cluster. Spark
> executors are unable to get hbase connection using phoenix. I am running
> knit command to get the ticket before starting the job and also keytab file
> and principal are correctly specified in connection URL. But still spark
> job on each node throws below error:
>
> 15/12/01 03:23:15 ERROR ipc.AbstractRpcClient: SASL authentication failed.
> The most likely cause is missing or invalid credentials. Consider 'kinit'.
> javax.security.sasl.SaslException: GSS initiate failed [Caused by
> GSSException: No valid credentials provided (Mechanism level: Failed to
> find any Kerberos tgt)]
> at
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
> at
> org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179)
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientImpl.java:605)
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.access$600(RpcClientImpl.java:154)
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:731)
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:728)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:728)
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:881)
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:850)
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1174)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300)
> at
> org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$BlockingStub.isMasterRunning(MasterProtos.java:46470)
> at
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$MasterServiceStubMaker.isMasterRunning(ConnectionManager.java:1606)
> at
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$StubMaker.makeStubNoRetries(ConnectionManager.java:1544)
> at
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$StubMaker.makeStub(ConnectionManager.java:1566)
> at
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$MasterServiceStubMaker.makeStub(ConnectionManager.java:1595)
> at
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getKeepAliveMasterService(ConnectionManager.java:1801)
> at
> org.apache.hadoop.hbase.client.MasterCallable.prepare(MasterCallable.java:38)
> at
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:124)
> at
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3678)
> at
> org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor(HBaseAdmin.java:451)
> at
> org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor(HBaseAdmin.java:473)
> at
> org.apache.phoenix.query.ConnectionQueryServicesImpl.ensureTableCreated(ConnectionQueryServicesImpl.java:804)
> at
> org.apache.phoenix.query.ConnectionQueryServicesImpl.createTable(ConnectionQueryServicesImpl.java:1194)
> at
> org.apache.phoenix.query.DelegateConnectionQueryServices.createTable(DelegateConnectionQueryServices.java:111)
> at
> org.apache.phoenix.schema.MetaDataClient.createTableInternal(MetaDataClient.java:1683)
> at
> org.apache.phoenix.schema.MetaDataClient.createTable(MetaDataClient.java:592)
> at
> org.apache.phoenix.compile.CreateTableCompiler$2.execute(CreateTableCompiler.java:177)
> at
> org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:280)
> at
> org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:272)
> at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
> at
> org.apache.phoenix.jdbc.PhoenixStatement.executeMutation(PhoenixStatement.java:270)
> at
> org.apache.phoenix.jdbc.PhoenixStatement.executeUpdate(PhoenixStatement.java:1052)
> at
> org.apache.phoenix.query.ConnectionQueryServicesImpl$11.call(ConnectionQueryServicesImpl.java:1841)
> at
>

Re: Spark streaming job hangs

2015-12-01 Thread Archit Thakur

Which version of spark you are runinng? Have you created Kafka-Directstream
? I am asking coz you might / might not be using receivers.
Also, When you say hangs, you mean there is no other log after this and
process still up?
Or do you mean, it kept on adding the jobs but did nothing else. (I am
optimistic :) ).

On Tue, Dec 1, 2015 at 4:12 PM, Paul Leclercq 
wrote:

> You might not have enough cores to process data from Kafka
>
>
>> When running a Spark Streaming program locally, do not use “local” or
>> “local[1]” as the master URL. Either of these means that only one thread
>> will be used for running tasks locally. If you are using a input DStream
>> based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single
>> thread will be used to run the receiver, leaving no thread for processing
>> the received data. *Hence, when running locally, always use “local[n]”
>> as the master URL, *where n > number of receivers to run (see Spark
>> Properties for information on how to set the master).*
>
>
>
>  
> https://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers
> 
>
> 2015-12-01 7:13 GMT+01:00 Cassa L :
>
>> Hi,
>>  I am reading data from Kafka into spark. It runs fine for sometime but
>> then hangs forever with following output. I don't see and errors in logs.
>> How do I debug this?
>>
>> 2015-12-01 06:04:30,697 [dag-scheduler-event-loop] INFO
>> (Logging.scala:59) - Adding task set 19.0 with 4 tasks
>> 2015-12-01 06:04:30,872 [pool-13-thread-1] INFO  (Logging.scala:59) -
>> Disconnected from Cassandra cluster: APG DEV Cluster
>> 2015-12-01 06:04:35,060 [JobGenerator] INFO  (Logging.scala:59) - Added
>> jobs for time 1448949875000 ms
>> 2015-12-01 06:04:40,054 [JobGenerator] INFO  (Logging.scala:59) - Added
>> jobs for time 144894988 ms
>> 2015-12-01 06:04:45,034 [JobGenerator] INFO  (Logging.scala:59) - Added
>> jobs for time 1448949885000 ms
>> 2015-12-01 06:04:50,100 [JobGenerator] INFO  (Logging.scala:59) - Added
>> jobs for time 144894989 ms
>> 2015-12-01 06:04:55,064 [JobGenerator] INFO  (Logging.scala:59) - Added
>> jobs for time 1448949895000 ms
>> 2015-12-01 06:05:00,125 [JobGenerator] INFO  (Logging.scala:59) - Added
>> jobs for time 144894990 ms
>>
>>
>> Thanks
>> LCassa
>>
>
>
>
> --
>
> Paul Leclercq | Data engineer
>
>
>  paul.lecle...@tabmo.io  |  http://www.tabmo.fr/
>

Re: spark rdd grouping

2015-12-01 Thread Jacek Laskowski

Hi Rajat,

My quick test has showed that groupBy will preserve the partitions:

scala> sc.parallelize(Seq(0,0,0,0,1,1,1,1),2).map((_,1)).mapPartitionsWithIndex
{ case (idx, iter) => val s = iter.toSeq; println(idx + " with " +
s.size + " elements: " + s); s.toIterator
}.groupBy(_._1).mapPartitionsWithIndex { case (idx, iter) => val s =
iter.toSeq; println(idx + " with " + s.size + " elements: " + s);
s.toIterator }.collect

1 with 4 elements: Stream((1,1), (1,1), (1,1), (1,1))
0 with 4 elements: Stream((0,1), (0,1), (0,1), (0,1))

0 with 1 elements: Stream((0,CompactBuffer((0,1), (0,1), (0,1), (0,1
1 with 1 elements: Stream((1,CompactBuffer((1,1), (1,1), (1,1), (1,1

Do I miss anything?

Pozdrawiam,
Jacek

--
Jacek Laskowski | https://medium.com/@jaceklaskowski/ |
http://blog.jaceklaskowski.pl
Mastering Spark https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
Follow me at https://twitter.com/jaceklaskowski
Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski


On Tue, Dec 1, 2015 at 2:46 AM, Rajat Kumar  wrote:
> Hi
>
> i have a javaPairRdd rdd1. i want to group by rdd1 by keys but preserve
> the partitions of original rdd only to avoid shuffle since I know all same
> keys are already in same partition.
>
> PairRdd is basically constrcuted using kafka streaming low level consumer
> which have all records with same key already in same partition. Can i group
> them together with avoid shuffle.
>
> Thanks
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark rdd grouping

2015-12-01 Thread ayan guha

I believe reduceByKeyLocally was introduced for this purpose.

On Tue, Dec 1, 2015 at 10:21 PM, Jacek Laskowski  wrote:

> Hi Rajat,
>
> My quick test has showed that groupBy will preserve the partitions:
>
> scala>
> sc.parallelize(Seq(0,0,0,0,1,1,1,1),2).map((_,1)).mapPartitionsWithIndex
> { case (idx, iter) => val s = iter.toSeq; println(idx + " with " +
> s.size + " elements: " + s); s.toIterator
> }.groupBy(_._1).mapPartitionsWithIndex { case (idx, iter) => val s =
> iter.toSeq; println(idx + " with " + s.size + " elements: " + s);
> s.toIterator }.collect
>
> 1 with 4 elements: Stream((1,1), (1,1), (1,1), (1,1))
> 0 with 4 elements: Stream((0,1), (0,1), (0,1), (0,1))
>
> 0 with 1 elements: Stream((0,CompactBuffer((0,1), (0,1), (0,1), (0,1
> 1 with 1 elements: Stream((1,CompactBuffer((1,1), (1,1), (1,1), (1,1
>
> Do I miss anything?
>
> Pozdrawiam,
> Jacek
>
> --
> Jacek Laskowski | https://medium.com/@jaceklaskowski/ |
> http://blog.jaceklaskowski.pl
> Mastering Spark https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
> Follow me at https://twitter.com/jaceklaskowski
> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
>
>
> On Tue, Dec 1, 2015 at 2:46 AM, Rajat Kumar 
> wrote:
> > Hi
> >
> > i have a javaPairRdd rdd1. i want to group by rdd1 by keys but
> preserve
> > the partitions of original rdd only to avoid shuffle since I know all
> same
> > keys are already in same partition.
> >
> > PairRdd is basically constrcuted using kafka streaming low level consumer
> > which have all records with same key already in same partition. Can i
> group
> > them together with avoid shuffle.
> >
> > Thanks
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Best Regards,
Ayan Guha

Unable to get phoenix connection in spark job in secured cluster

2015-12-01 Thread Akhilesh Pathodia

Hi,

I am running spark job on yarn in cluster mode in secured cluster. Spark
executors are unable to get hbase connection using phoenix. I am running
knit command to get the ticket before starting the job and also keytab file
and principal are correctly specified in connection URL. But still spark
job on each node throws below error:

15/12/01 03:23:15 ERROR ipc.AbstractRpcClient: SASL authentication failed.
The most likely cause is missing or invalid credentials. Consider 'kinit'.
javax.security.sasl.SaslException: GSS initiate failed [Caused by
GSSException: No valid credentials provided (Mechanism level: Failed to
find any Kerberos tgt)]
at
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
at
org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179)
at
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientImpl.java:605)
at
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.access$600(RpcClientImpl.java:154)
at
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:731)
at
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:728)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:728)
at
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:881)
at
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:850)
at
org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1174)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300)
at
org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$BlockingStub.isMasterRunning(MasterProtos.java:46470)
at
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$MasterServiceStubMaker.isMasterRunning(ConnectionManager.java:1606)
at
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$StubMaker.makeStubNoRetries(ConnectionManager.java:1544)
at
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$StubMaker.makeStub(ConnectionManager.java:1566)
at
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$MasterServiceStubMaker.makeStub(ConnectionManager.java:1595)
at
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getKeepAliveMasterService(ConnectionManager.java:1801)
at
org.apache.hadoop.hbase.client.MasterCallable.prepare(MasterCallable.java:38)
at
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:124)
at
org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3678)
at
org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor(HBaseAdmin.java:451)
at
org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor(HBaseAdmin.java:473)
at
org.apache.phoenix.query.ConnectionQueryServicesImpl.ensureTableCreated(ConnectionQueryServicesImpl.java:804)
at
org.apache.phoenix.query.ConnectionQueryServicesImpl.createTable(ConnectionQueryServicesImpl.java:1194)
at
org.apache.phoenix.query.DelegateConnectionQueryServices.createTable(DelegateConnectionQueryServices.java:111)
at
org.apache.phoenix.schema.MetaDataClient.createTableInternal(MetaDataClient.java:1683)
at
org.apache.phoenix.schema.MetaDataClient.createTable(MetaDataClient.java:592)
at
org.apache.phoenix.compile.CreateTableCompiler$2.execute(CreateTableCompiler.java:177)
at
org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:280)
at
org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:272)
at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
at
org.apache.phoenix.jdbc.PhoenixStatement.executeMutation(PhoenixStatement.java:270)
at
org.apache.phoenix.jdbc.PhoenixStatement.executeUpdate(PhoenixStatement.java:1052)
at
org.apache.phoenix.query.ConnectionQueryServicesImpl$11.call(ConnectionQueryServicesImpl.java:1841)
at
org.apache.phoenix.query.ConnectionQueryServicesImpl$11.call(ConnectionQueryServicesImpl.java:1810)
at
org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixContextExecutor.java:77)
at
org.apache.phoenix.query.ConnectionQueryServicesImpl.init(ConnectionQueryServicesImpl.java:1810)
at

Re: Spark Streaming Specify Kafka Partition

2015-12-01 Thread Cody Koeninger

Yes, there is a version of createDirectStream that lets you specify
fromOffsets: Map[TopicAndPartition, Long]

On Mon, Nov 30, 2015 at 7:43 PM, Alan Braithwaite 
wrote:

> Is there any mechanism in the kafka streaming source to specify the exact
> partition id that we want a streaming job to consume from?
>
> If not, is there a workaround besides writing our a custom receiver?
>
> Thanks,
> - Alan
>

Re: Recovery for Spark Streaming Kafka Direct in case of issues with Kafka

2015-12-01 Thread Cody Koeninger

If you're consistently getting offset out of range exceptions, it's
probably because messages are getting deleted before you've processed them.

The only real way to deal with this is give kafka more retention, consume
faster, or both.

If you're just looking for a quick "fix" for an infrequent issue, option 4
is probably easiest.  I wouldn't do that automatically / silently, because
you're losing data.

On Mon, Nov 30, 2015 at 6:22 PM, SRK  wrote:

> Hi,
>
> So, our Streaming Job fails with the following errors. If you see the
> errors
> below, they are all related to Kafka losing offsets and
> OffsetOutOfRangeException.
>
> What are the options we have other than fixing Kafka? We would like to do
> something like the following. How can we achieve 1 and 2 with Spark Kafka
> Direct?
>
> 1.Need to see a way to skip some offsets if they are not available after
> the
> max retries are reached..in that case there might be data loss.
>
> 2.Catch that exception and somehow force things to "reset" for that
> partition And how would it handle the offsets already calculated in the
> backlog (if there is one)?
>
> 3.Track the offsets separately, restart the job by providing the offsets.
>
> 4.Or a straightforward approach would be to monitor the log for this error,
> and if it occurs more than X times, kill the job, remove the checkpoint
> directory, and restart.
>
> ERROR DirectKafkaInputDStream: ArrayBuffer(kafka.common.UnknownException,
> org.apache.spark.SparkException: Couldn't find leader offsets for
> Set([test_stream,5]))
>
>
>
> java.lang.ClassNotFoundException:
> kafka.common.NotLeaderForPartitionException
>
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
>
>
>
> java.util.concurrent.RejectedExecutionException: Task
> org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler@a48c5a8
> rejected from java.util.concurrent.ThreadPoolExecutor@543258e0[Terminated,
> pool size = 0, active threads = 0, queued tasks = 0, completed tasks =
> 12112]
>
>
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 10
> in stage 52.0 failed 4 times, most recent failure: Lost task 10.3 in stage
> 52.0 (TID 255, 172.16.97.97): UnknownReason
>
> Exception in thread "streaming-job-executor-0" java.lang.Error:
> java.lang.InterruptedException
>
> Caused by: java.lang.InterruptedException
>
> java.lang.ClassNotFoundException: kafka.common.OffsetOutOfRangeException
>
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
>
>
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 7
> in
> stage 33.0 failed 4 times, most recent failure: Lost task 7.3 in stage 33.0
> (TID 283, 172.16.97.103): UnknownReason
>
> java.lang.ClassNotFoundException: kafka.common.OffsetOutOfRangeException
>
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
>
> java.lang.ClassNotFoundException: kafka.common.OffsetOutOfRangeException
>
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Recovery-for-Spark-Streaming-Kafka-Direct-in-case-of-issues-with-Kafka-tp25524.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

RE: Scala 2.11 and Akka 2.4.0

2015-12-01 Thread Boavida, Rodrigo

Hi Ted,

Thanks for getting back to me and for the suggestion.

Running a 'mvn dependency:tree' I get the following:

[ERROR] Failed to execute goal on project spark-core_2.11: Could not resolve 
dependencies for project org.apache.spark:spark-core_2.11:jar:1.5.2: The 
following artifacts could not be resolved: 
com.typesafe.akka:akka-remote_2.10:jar:2.4.0, 
com.typesafe.akka:akka-slf4j_2.10:jar:2.4.0, 
com.typesafe.akka:akka-testkit_2.10:jar:2.4.0: Could not find artifact 
com.typesafe.akka:akka-remote_2.10:jar:2.4.0 in central 
(https://repo1.maven.org/maven2) -> [Help 1]

So it seems somehow it's still pulling some 2.10 dependencies. Do you think 
this could be the cause for the observed problem?

tnks,
Rod

-Original Message-
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: 01 December 2015 14:13
To: Boavida, Rodrigo 
Cc: user@spark.apache.org
Subject: Re: Scala 2.11 and Akka 2.4.0

Have you run 'mvn dependency:tree' and examined the output ?

There should be some hint.

Cheers

> On Dec 1, 2015, at 5:32 AM, RodrigoB  wrote:
>
> Hi,
>
> I'm currently trying to build spark with Scala 2.11 and Akka 2.4.0.
> I've changed the main pom.xml files to corresponding akka version and
> am getting the following exception when starting the master on standalone:
>
> Exception Details:
>  Location:
>akka/dispatch/Mailbox.processAllSystemMessages()V @152: getstatic
>  Reason:
>Type top (current frame, locals[9]) is not assignable to
> 'akka/dispatch/sysmsg/SystemMessage' (stack map, locals[9])  Current
> Frame:
>bci: @131
>flags: { }
>locals: { 'akka/dispatch/Mailbox',
> 'java/lang/InterruptedException',
> 'akka/dispatch/sysmsg/SystemMessage', top, 'akka/dispatch/Mailbox', 
> 'java/lang/Throwable', 'java/lang/Throwable' }
>stack: { integer }
>  Stackmap Frame:
>bci: @152
>flags: { }
>locals: { 'akka/dispatch/Mailbox',
> 'java/lang/InterruptedException',
> 'akka/dispatch/sysmsg/SystemMessage', top, 'akka/dispatch/Mailbox',
> 'java/lang/Throwable', 'java/lang/Throwable', top, top, 
> 'akka/dispatch/sysmsg/SystemMessage' }
>stack: { }
>  Bytecode:
>0x000: 014c 2ab2 0132 b601 35b6 0139 4db2 013e
>0x010: 2cb6 0142 9900 522a b600 c69a 004b 2c4e
>0x020: b201 3e2c b601 454d 2db9 0148 0100 2ab6
>0x030: 0052 2db6 014b b801 0999 000e bb00 e759
>0x040: 1301 4db7 010f 4cb2 013e 2cb6 0150 99ff
>0x050: bf2a b600 c69a ffb8 2ab2 0132 b601 35b6
>0x060: 0139 4da7 ffaa 2ab6 0052 b600 56b6 0154
>0x070: b601 5a3a 04a7 0091 3a05 1905 3a06 1906
>0x080: c100 e799 0015 1906 c000 e73a 0719 074c
>0x090: b200 f63a 08a7 0071 b201 5f19 06b6 0163
>0x0a0: 3a0a 190a b601 6899 0006 1905 bf19 0ab6
>0x0b0: 016c c000 df3a 0b2a b600 52b6 0170 b601
>0x0c0: 76bb 000f 5919 0b2a b600 52b6 017a b601
>0x0d0: 80b6 0186 2ab6 018a bb01 8c59 b701 8e13
>0x0e0: 0190 b601 9419 09b6 0194 1301 96b6 0194
>0x0f0: 190b b601 99b6 0194 b601 9ab7 019d b601
>0x100: a3b2 00f6 3a08 b201 3e2c b601 4299 0026
>0x110: 2c3a 09b2 013e 2cb6 0145 4d19 09b9 0148
>0x120: 0100 1904 2ab6 0052 b601 7a19 09b6 01a7
>0x130: a7ff d62b c600 09b8 0109 572b bfb1  Exception Handler
> Table:
>bci [290, 307] => handler: 120
>  Stackmap Table:
>append_frame(@13,Object[#231],Object[#177])
>append_frame(@71,Object[#177])
>chop_frame(@102,1)
>
> full_frame(@120,{Object[#2],Object[#231],Object[#177],Top,Object[#2],O
> bject[#177]},{Object[#223]})
>
> full_frame(@152,{Object[#2],Object[#231],Object[#177],Top,Object[#2],Object[#223],Object[#223],Top,Top,Object[#177]},{})
>append_frame(@173,Object[#357])
>
> full_frame(@262,{Object[#2],Object[#231],Object[#177],Top,Object[#2]},{})
>same_frame(@307)
>same_frame(@317)
>   at akka.dispatch.Mailboxes.(Mailboxes.scala:33)
>at akka.actor.ActorSystemImpl.(ActorSystem.scala:635)
>at akka.actor.ActorSystem$.apply(ActorSystem.scala:143)
>at akka.actor.ActorSystem$.apply(ActorSystem.scala:120)
>at
> org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
>at
> org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
>at
> org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:52)
>at
> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1920)
>at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:166)
>at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1911)
>at
> org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:55)
>at
> org.apache.spark.rpc.akka.AkkaRpcEnvFactory.create(AkkaRpcEnv.scala:253)
>at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:53)
>at
> org.apache.spark.deploy.master.Master$.startRpcEnvAndEndpoint(Master.scala:1074)
>

Re: Scala 2.11 and Akka 2.4.0

2015-12-01 Thread Ted Yu

Have you run 'mvn dependency:tree' and examined the output ?

There should be some hint. 

Cheers

> On Dec 1, 2015, at 5:32 AM, RodrigoB  wrote:
> 
> Hi,
> 
> I'm currently trying to build spark with Scala 2.11 and Akka 2.4.0.
> I've changed the main pom.xml files to corresponding akka version and am
> getting the following exception when starting the master on standalone:
> 
> Exception Details:
>  Location:
>akka/dispatch/Mailbox.processAllSystemMessages()V @152: getstatic
>  Reason:
>Type top (current frame, locals[9]) is not assignable to
> 'akka/dispatch/sysmsg/SystemMessage' (stack map, locals[9])
>  Current Frame:
>bci: @131
>flags: { }
>locals: { 'akka/dispatch/Mailbox', 'java/lang/InterruptedException',
> 'akka/dispatch/sysmsg/SystemMessage', top, 'akka/dispatch/Mailbox',
> 'java/lang/Throwable', 'java/lang/Throwable' }
>stack: { integer }
>  Stackmap Frame:
>bci: @152
>flags: { }
>locals: { 'akka/dispatch/Mailbox', 'java/lang/InterruptedException',
> 'akka/dispatch/sysmsg/SystemMessage', top, 'akka/dispatch/Mailbox',
> 'java/lang/Throwable', 'java/lang/Throwable', top, top,
> 'akka/dispatch/sysmsg/SystemMessage' }
>stack: { }
>  Bytecode:
>0x000: 014c 2ab2 0132 b601 35b6 0139 4db2 013e
>0x010: 2cb6 0142 9900 522a b600 c69a 004b 2c4e
>0x020: b201 3e2c b601 454d 2db9 0148 0100 2ab6
>0x030: 0052 2db6 014b b801 0999 000e bb00 e759
>0x040: 1301 4db7 010f 4cb2 013e 2cb6 0150 99ff
>0x050: bf2a b600 c69a ffb8 2ab2 0132 b601 35b6
>0x060: 0139 4da7 ffaa 2ab6 0052 b600 56b6 0154
>0x070: b601 5a3a 04a7 0091 3a05 1905 3a06 1906
>0x080: c100 e799 0015 1906 c000 e73a 0719 074c
>0x090: b200 f63a 08a7 0071 b201 5f19 06b6 0163
>0x0a0: 3a0a 190a b601 6899 0006 1905 bf19 0ab6
>0x0b0: 016c c000 df3a 0b2a b600 52b6 0170 b601
>0x0c0: 76bb 000f 5919 0b2a b600 52b6 017a b601
>0x0d0: 80b6 0186 2ab6 018a bb01 8c59 b701 8e13
>0x0e0: 0190 b601 9419 09b6 0194 1301 96b6 0194
>0x0f0: 190b b601 99b6 0194 b601 9ab7 019d b601
>0x100: a3b2 00f6 3a08 b201 3e2c b601 4299 0026
>0x110: 2c3a 09b2 013e 2cb6 0145 4d19 09b9 0148
>0x120: 0100 1904 2ab6 0052 b601 7a19 09b6 01a7
>0x130: a7ff d62b c600 09b8 0109 572b bfb1
>  Exception Handler Table:
>bci [290, 307] => handler: 120
>  Stackmap Table:
>append_frame(@13,Object[#231],Object[#177])
>append_frame(@71,Object[#177])
>chop_frame(@102,1)
> 
> full_frame(@120,{Object[#2],Object[#231],Object[#177],Top,Object[#2],Object[#177]},{Object[#223]})
> 
> full_frame(@152,{Object[#2],Object[#231],Object[#177],Top,Object[#2],Object[#223],Object[#223],Top,Top,Object[#177]},{})
>append_frame(@173,Object[#357])
> 
> full_frame(@262,{Object[#2],Object[#231],Object[#177],Top,Object[#2]},{})
>same_frame(@307)
>same_frame(@317)
>   at akka.dispatch.Mailboxes.(Mailboxes.scala:33)
>at akka.actor.ActorSystemImpl.(ActorSystem.scala:635)
>at akka.actor.ActorSystem$.apply(ActorSystem.scala:143)
>at akka.actor.ActorSystem$.apply(ActorSystem.scala:120)
>at
> org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
>at
> org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
>at
> org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:52)
>at
> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1920)
>at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:166)
>at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1911)
>at
> org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:55)
>at
> org.apache.spark.rpc.akka.AkkaRpcEnvFactory.create(AkkaRpcEnv.scala:253)
>at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:53)
>at
> org.apache.spark.deploy.master.Master$.startRpcEnvAndEndpoint(Master.scala:1074)
>at org.apache.spark.deploy.master.Master$.main(Master.scala:1058)
>at org.apache.spark.deploy.master.Master.main(Master.scala)
> 
> ---
> 
> Has anyone encountered this problem before? Seems to be related with a
> version mismatch at some level with the Akka mailbox. I would very much
> appreciate any comments.
> 
> tnks,
> Rod
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Scala-2-11-and-Akka-2-4-0-tp25535.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail:

Send JsonDocument to Couchbase

2015-12-01 Thread Eyal Sharon

Hi ,

I am still having problems with Couchbase connector . Consider the
following code fragment which aims to create a Json document to send to
Couch






*val muCache = {  val values =
JsonArray.from(mu.toArray.map(_.toString))  val content =
JsonObject.create().put("feature_mean",
values).put("features",mu.size)  JsonDocument.create("mu",content)}*

*mu is vector of double


This fails because JsonArray is a Java method which doesn't except
scala's string. Got this error

*Unsupported type for JsonArray: class [Ljava.lang.String;*


1- What could be a solution  for that  ?

2- For a more clean usage  ,I thought of using concrete Scala Json
libraries, but neither of them supports JsonDocuments

I followed the implantation from  Couchbases spark connector
documentation   -
http://developer.couchbase.com/documentation/server/4.0/connectors/spark-1.0/spark-intro.html


Thanks !

-- 


*This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are 
addressed. Please note that any disclosure, copying or distribution of the 
content of this information is strictly forbidden. If you have received 
this email message in error, please destroy it immediately and notify its 
sender.*

Re: Spark Streaming Specify Kafka Partition

2015-12-01 Thread Cody Koeninger

I actually haven't tried that, since I tend to do the offset lookups if
necessary.

It's possible that it will work, try it and let me know.

Be aware that if you're doing a count() or take() operation directly on the
rdd it'll definitely give you the wrong result if you're using -1 for one
of the offsets.

On Tue, Dec 1, 2015 at 9:58 AM, Alan Braithwaite 
wrote:

> Neat, thanks.  If I specify something like -1 as the offset, will it
> consume from the latest offset or do I have to instrument that manually?
>
> - Alan
>
> On Tue, Dec 1, 2015 at 6:43 AM, Cody Koeninger  wrote:
>
>> Yes, there is a version of createDirectStream that lets you specify
>> fromOffsets: Map[TopicAndPartition, Long]
>>
>> On Mon, Nov 30, 2015 at 7:43 PM, Alan Braithwaite 
>> wrote:
>>
>>> Is there any mechanism in the kafka streaming source to specify the
>>> exact partition id that we want a streaming job to consume from?
>>>
>>> If not, is there a workaround besides writing our a custom receiver?
>>>
>>> Thanks,
>>> - Alan
>>>
>>
>>
>

Re: spark-ec2 vs. EMR

2015-12-01 Thread Nick Chammas

Pinging this thread in case anyone has thoughts on the matter they want to
share.

On Sat, Nov 21, 2015 at 11:32 AM Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Spark has come bundled with spark-ec2
>  for many years. At
> the same time, EMR has been capable of running Spark for a while, and
> earlier this year it added "official" support
> .
>
> If you're looking for a way to provision Spark clusters, there are some
> clear differences between these 2 options. I think the biggest one would be
> that EMR is a "production" solution backed by a company, whereas spark-ec2
> is not really intended for production use (as far as I know).
>
> That particular difference in intended use may or may not matter to you,
> but I'm curious:
>
> What are some of the other differences between the 2 that do matter to
> you? If you were considering these 2 solutions for your use case at one
> point recently, why did you choose one over the other?
>
> I'd be especially interested in hearing about why people might choose
> spark-ec2 over EMR, since the latter option seems to have shaped up nicely
> this year.
>
> Nick
>
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Re-spark-ec2-vs-EMR-tp25538.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Migrate a cassandra table among from one cluster to another

2015-12-01 Thread George Sigletos

Hello,

Does anybody know how to copy a cassandra table (or an entire keyspace)
from one cluster to another using Spark? I haven't found anything very
specific about this so far.

Thank you,
George

Re: Unable to get phoenix connection in spark job in secured cluster

2015-12-01 Thread Akhilesh Pathodia

Spark - 1.3.1
Hbase - 1.0.0
Phoenix - 4.3
Cloudera - 5.4

On Tue, Dec 1, 2015 at 9:35 PM, Ted Yu  wrote:

> What are the versions for Spark / HBase / Phoenix you're using ?
>
> Cheers
>
> On Tue, Dec 1, 2015 at 4:15 AM, Akhilesh Pathodia <
> pathodia.akhil...@gmail.com> wrote:
>
>> Hi,
>>
>> I am running spark job on yarn in cluster mode in secured cluster. Spark
>> executors are unable to get hbase connection using phoenix. I am running
>> knit command to get the ticket before starting the job and also keytab file
>> and principal are correctly specified in connection URL. But still spark
>> job on each node throws below error:
>>
>> 15/12/01 03:23:15 ERROR ipc.AbstractRpcClient: SASL authentication
>> failed. The most likely cause is missing or invalid credentials. Consider
>> 'kinit'.
>> javax.security.sasl.SaslException: GSS initiate failed [Caused by
>> GSSException: No valid credentials provided (Mechanism level: Failed to
>> find any Kerberos tgt)]
>> at
>> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
>> at
>> org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179)
>> at
>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientImpl.java:605)
>> at
>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.access$600(RpcClientImpl.java:154)
>> at
>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:731)
>> at
>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:728)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:415)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
>> at
>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:728)
>> at
>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:881)
>> at
>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:850)
>> at
>> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1174)
>> at
>> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
>> at
>> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300)
>> at
>> org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$BlockingStub.isMasterRunning(MasterProtos.java:46470)
>> at
>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$MasterServiceStubMaker.isMasterRunning(ConnectionManager.java:1606)
>> at
>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$StubMaker.makeStubNoRetries(ConnectionManager.java:1544)
>> at
>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$StubMaker.makeStub(ConnectionManager.java:1566)
>> at
>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$MasterServiceStubMaker.makeStub(ConnectionManager.java:1595)
>> at
>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getKeepAliveMasterService(ConnectionManager.java:1801)
>> at
>> org.apache.hadoop.hbase.client.MasterCallable.prepare(MasterCallable.java:38)
>> at
>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:124)
>> at
>> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3678)
>> at
>> org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor(HBaseAdmin.java:451)
>> at
>> org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor(HBaseAdmin.java:473)
>> at
>> org.apache.phoenix.query.ConnectionQueryServicesImpl.ensureTableCreated(ConnectionQueryServicesImpl.java:804)
>> at
>> org.apache.phoenix.query.ConnectionQueryServicesImpl.createTable(ConnectionQueryServicesImpl.java:1194)
>> at
>> org.apache.phoenix.query.DelegateConnectionQueryServices.createTable(DelegateConnectionQueryServices.java:111)
>> at
>> org.apache.phoenix.schema.MetaDataClient.createTableInternal(MetaDataClient.java:1683)
>> at
>> org.apache.phoenix.schema.MetaDataClient.createTable(MetaDataClient.java:592)
>> at
>> org.apache.phoenix.compile.CreateTableCompiler$2.execute(CreateTableCompiler.java:177)
>> at
>> org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:280)
>> at
>> org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:272)
>> at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
>> at
>> org.apache.phoenix.jdbc.PhoenixStatement.executeMutation(PhoenixStatement.java:270)
>> at
>>

85 matches

Mail list logo