from:"Akhil Das"

Re: Where does the Driver run?

2019-03-24 Thread Akhil Das

There's also a driver ui (usually available on port 4040), after running
your code, I assume you are running it on your machine, visit
localhost:4040 and you will get the driver UI.

If you think the driver is running on your master/executor nodes, login to
those machines and do a

   netstat -napt | grep -I listen

You will see the driver listening on 404x there, this won't be the case
mostly as you are not doing Spark-submit or using the deployMode=cluster.

On Mon, 25 Mar 2019, 01:03 Pat Ferrel,  wrote:

> Thanks, I have seen this many times in my research. Paraphrasing docs: “in
> deployMode ‘cluster' the Driver runs on a Worker in the cluster”
>
> When I look at logs I see 2 executors on the 2 slaves (executor 0 and 1
> with addresses that match slaves). When I look at memory usage while the
> job runs I see virtually identical usage on the 2 Workers. This would
> support your claim and contradict Spark docs for deployMode = cluster.
>
> The evidence seems to contradict the docs. I am now beginning to wonder if
> the Driver only runs in the cluster if we use spark-submit
>
>
>
> From: Akhil Das  
> Reply: Akhil Das  
> Date: March 23, 2019 at 9:26:50 PM
> To: Pat Ferrel  
> Cc: user  
> Subject:  Re: Where does the Driver run?
>
> If you are starting your "my-app" on your local machine, that's where the
> driver is running.
>
> [image: image.png]
>
> Hope this helps.
> <https://spark.apache.org/docs/latest/cluster-overview.html>
>
> On Sun, Mar 24, 2019 at 4:13 AM Pat Ferrel  wrote:
>
>> I have researched this for a significant amount of time and find answers
>> that seem to be for a slightly different question than mine.
>>
>> The Spark 2.3.3 cluster is running fine. I see the GUI on “
>> http://master-address:8080;, there are 2 idle workers, as configured.
>>
>> I have a Scala application that creates a context and starts execution of
>> a Job. I *do not use spark-submit*, I start the Job programmatically and
>> this is where many explanations forks from my question.
>>
>> In "my-app" I create a new SparkConf, with the following code (slightly
>> abbreviated):
>>
>>   conf.setAppName(“my-job")
>>   conf.setMaster(“spark://master-address:7077”)
>>   conf.set(“deployMode”, “cluster”)
>>   // other settings like driver and executor memory requests
>>   // the driver and executor memory requests are for all mem on the
>> slaves, more than
>>   // mem available on the launching machine with “my-app"
>>   val jars = listJars(“/path/to/lib")
>>   conf.setJars(jars)
>>   …
>>
>> When I launch the job I see 2 executors running on the 2 workers/slaves.
>> Everything seems to run fine and sometimes completes successfully. Frequent
>> failures are the reason for this question.
>>
>> Where is the Driver running? I don’t see it in the GUI, I see 2 Executors
>> taking all cluster resources. With a Yarn cluster I would expect the
>> “Driver" to run on/in the Yarn Master but I am using the Spark Standalone
>> Master, where is the Drive part of the Job running?
>>
>> If is is running in the Master, we are in trouble because I start the
>> Master on one of my 2 Workers sharing resources with one of the Executors.
>> Executor mem + driver mem is > available mem on a Worker. I can change this
>> but need so understand where the Driver part of the Spark Job runs. Is it
>> in the Spark Master, or inside and Executor, or ???
>>
>> The “Driver” creates and broadcasts some large data structures so the
>> need for an answer is more critical than with more typical tiny Drivers.
>>
>> Thanks for you help!
>>
>
>
> --
> Cheers!
>
>

Re: Where does the Driver run?

2019-03-23 Thread Akhil Das

If you are starting your "my-app" on your local machine, that's where the
driver is running.

[image: image.png]

Hope this helps.


On Sun, Mar 24, 2019 at 4:13 AM Pat Ferrel  wrote:

> I have researched this for a significant amount of time and find answers
> that seem to be for a slightly different question than mine.
>
> The Spark 2.3.3 cluster is running fine. I see the GUI on “
> http://master-address:8080;, there are 2 idle workers, as configured.
>
> I have a Scala application that creates a context and starts execution of
> a Job. I *do not use spark-submit*, I start the Job programmatically and
> this is where many explanations forks from my question.
>
> In "my-app" I create a new SparkConf, with the following code (slightly
> abbreviated):
>
>   conf.setAppName(“my-job")
>   conf.setMaster(“spark://master-address:7077”)
>   conf.set(“deployMode”, “cluster”)
>   // other settings like driver and executor memory requests
>   // the driver and executor memory requests are for all mem on the
> slaves, more than
>   // mem available on the launching machine with “my-app"
>   val jars = listJars(“/path/to/lib")
>   conf.setJars(jars)
>   …
>
> When I launch the job I see 2 executors running on the 2 workers/slaves.
> Everything seems to run fine and sometimes completes successfully. Frequent
> failures are the reason for this question.
>
> Where is the Driver running? I don’t see it in the GUI, I see 2 Executors
> taking all cluster resources. With a Yarn cluster I would expect the
> “Driver" to run on/in the Yarn Master but I am using the Spark Standalone
> Master, where is the Drive part of the Job running?
>
> If is is running in the Master, we are in trouble because I start the
> Master on one of my 2 Workers sharing resources with one of the Executors.
> Executor mem + driver mem is > available mem on a Worker. I can change this
> but need so understand where the Driver part of the Spark Job runs. Is it
> in the Spark Master, or inside and Executor, or ???
>
> The “Driver” creates and broadcasts some large data structures so the need
> for an answer is more critical than with more typical tiny Drivers.
>
> Thanks for you help!
>


-- 
Cheers!

Re: Configuration for unit testing and sql.shuffle.partitions

2017-09-16 Thread Akhil Das

spark.sql.shuffle.partitions is still used I believe. I can see it in the
code

and
in the documentation page

.

On Wed, Sep 13, 2017 at 4:46 AM, peay  wrote:

> Hello,
>
> I am running unit tests with Spark DataFrames, and I am looking for
> configuration tweaks that would make tests faster. Usually, I use a
> local[2] or local[4] master.
>
> Something that has been bothering me is that most of my stages end up
> using 200 partitions, independently of whether I repartition the input.
> This seems a bit overkill for small unit tests that barely have 200 rows
> per DataFrame.
>
> spark.sql.shuffle.partitions used to control this I believe, but it seems
> to be gone and I could not find any information on what mechanism/setting
> replaces it or the corresponding JIRA.
>
> Has anyone experience to share on how to tune Spark best for very small
> local runs like that?
>
> Thanks!
>
>


-- 
Cheers!

Re: PLs assist: trying to FlatMap a DataSet / partially OT

2017-09-16 Thread Akhil Das

scala> case class Fruit(price: Double, name: String)
defined class Fruit

scala> val ds = Seq(Fruit(10.0,"Apple")).toDS()
ds: org.apache.spark.sql.Dataset[Fruit] = [price: double, name: string]

scala> ds.rdd.flatMap(f => f.name.toList).collect
res8: Array[Char] = Array(A, p, p, l, e)


This is what you want to do?

On Fri, Sep 15, 2017 at 4:21 AM, Marco Mistroni  wrote:

> HI all
>  could anyone assist pls?
> i am trying to flatMap a DataSet[(String, String)] and i am getting errors
> in Eclipse
> the errors are more Scala related than spark -related, but i was wondering
> if someone came across
> a similar situation
>
> here's what i got. A DS of (String, String) , out of which i am using
> flatMap to get a List[Char] of for the second element in the tuple.
>
> val tplDataSet = < DataSet[(String, String)] >
>
> val expanded = tplDataSet.flatMap(tpl  => tpl._2.toList,
> Encoders.product[(String, String)])
>
>
> Eclipse complains that  'tpl' in the above function is missing parameter
> type
>
> what am i missing? or perhaps i am using the wrong approach?
>
> w/kindest regards
>  Marco
>



-- 
Cheers!

Re: Size exceeds Integer.MAX_VALUE issue with RandomForest

2017-09-16 Thread Akhil Das

What are the parameters you passed to the classifier and what is the size
of your train data? You are hitting that issue because one of the block
size is over 2G, repartitioning the data will help.

On Fri, Sep 15, 2017 at 7:55 PM, rpulluru  wrote:

> Hi,
>
> I am using sparkR randomForest function and running into
> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE issue.
> Looks like I am running into this issue
> https://issues.apache.org/jira/browse/SPARK-1476, I used
> spark.default.parallelism=1000 but still facing the same issue.
>
> Thanks
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-- 
Cheers!

Re: [SPARK-SQL] Does spark-sql have Authorization built in?

2017-09-16 Thread Akhil Das

I guess no. I came across a test case where they are marked as Unsupported,
you can see it here.

However,
the one running inside Databricks has support for this.
https://docs.databricks.com/spark/latest/spark-sql/structured-data-access-controls.html

On Fri, Sep 15, 2017 at 10:13 PM, Arun Khetarpal 
wrote:

> Hi -
>
> Wanted to understand if spark sql has GRANT and REVOKE statements
> available?
> Is anyone working on making that available?
>
> Regards,
> Arun
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-- 
Cheers!

Re: spark.streaming.receiver.maxRate

2017-09-16 Thread Akhil Das

I believe that's a question to the NiFi list, as you can see the the code
base is quite old
https://github.com/apache/nifi/tree/master/nifi-external/nifi-spark-receiver/src/main/java/org/apache/nifi/spark
and it doesn't make use of the
https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/receiver/RateLimiter.scala


On Sat, Sep 16, 2017 at 1:59 AM, Margus Roo  wrote:

> Some more info
>
> val lines = ssc.socketStream() // worksval lines = ssc.receiverStream(new 
> NiFiReceiver(conf, StorageLevel.MEMORY_AND_DISK_SER_2)) // does not work
>
> Margus (margusja) Roohttp://margus.roo.ee
> skype: margusjahttps://www.facebook.com/allan.tuuring
> +372 51 48 780
>
> On 15/09/2017 21:50, Margus Roo wrote:
>
> Hi
>
> I tested spark.streaming.receiver.maxRate and
> spark.streaming.backpressure.enabled settings using socketStream and it
> works.
>
> But if I am using nifi-spark-receiver (https://mvnrepository.com/
> artifact/org.apache.nifi/nifi-spark-receiver) then it does not using
> spark.streaming.receiver.maxRate
>
> any workaround?
>
> Margus (margusja) Roohttp://margus.roo.ee
> skype: margusjahttps://www.facebook.com/allan.tuuring
> +372 51 48 780
>
> On 14/09/2017 09:57, Margus Roo wrote:
>
> Hi
>
> Using Spark 2.1.1.2.6-1.0-129 (from Hortonworks distro) and Scala 2.11.8
> and Java 1.8.0_60
>
> I have Nifi flow produces more records than Spark stream can work in batch
> time. To avoid spark queue overflow I wanted to try spark streaming
> backpressure (did not work for my) so back to the more simple but static
> solution I tried spark.streaming.receiver.maxRate.
>
> I set it spark.streaming.receiver.maxRate=1. As I understand it from
> Spark manual: "If the batch processing time is more than batchinterval
> then obviously the receiver’s memory will start filling up and will end up
> in throwing exceptions (most probably BlockNotFoundException). Currently
> there is no way to pause the receiver. Using SparkConf configuration
> spark.streaming.receiver.maxRate, rate of receiver can be limited." - it
> means 1 record per second?
>
> I have very simple code:
>
> val conf = new 
> SiteToSiteClient.Builder().url("http://192.168.80.120:9090/nifi; 
> ).portName("testing").buildConfig()val ssc = 
> new StreamingContext(sc, Seconds(1))
> val lines = ssc.receiverStream(new NiFiReceiver(conf, 
> StorageLevel.MEMORY_AND_DISK))
> lines.print()
>
> ssc.start()
>
>
> I have loads of records waiting in Nifi testing port. After I start
> ssc.start() I will get 9-7 records per batch (batch time is 1s). Do I
> understand spark.streaming.receiver.maxRate wrong?
>
> --
> Margus (margusja) Roohttp://margus.roo.ee
> skype: margusjahttps://www.facebook.com/allan.tuuring
> +372 51 48 780
>
>
>
>


-- 
Cheers!

Re: How is data desensitization (example: select bank_no from users)?

2017-08-24 Thread Akhil Das

Usually analysts will not have access to data stored in the PCI Zone, you
could write the data out to a table for the analysts by masking the
sensitive information.

Eg:


> val mask_udf = udf((info: String) => info.patch(0, "*" * 12, 7))
> val df = sc.parallelize(Seq(("user1", "400-000-444"))).toDF("user", 
> "sensitive_info")
> df.show

+-+--+
| user|sensitive_info|
+-+--+
|user1|   400-000-444|
+-+--+

> df.withColumn("sensitive_info", mask_udf($"sensitive_info")).show

+-++
| user|  sensitive_info|
+-++
|user1|-444|
+-++


On Sat, Aug 19, 2017 at 10:42 PM, 李斌松  wrote:

> For example, the user's bank card number cannot be viewed by an analyst
> and replaced by an asterisk. How do you do that in spark?
>



-- 
Cheers!

Re: UI for spark machine learning.

2017-08-24 Thread Akhil Das

How many iterations are you doing on the data? Like Jörn said, you don't
necessarily need a billion samples for linear regression.

On Tue, Aug 22, 2017 at 6:28 PM, Sea aj  wrote:

> Jorn,
>
> My question is not about the model type but instead, the spark capability
> on reusing any already trained ml model in training a new model.
>
>
>
>
> On Tue, Aug 22, 2017 at 1:13 PM, Jörn Franke  wrote:
>
>> Is it really required to have one billion samples for just linear
>> regression? Probably your model would do equally well with much less
>> samples. Have you checked bias and variance if you use much less random
>> samples?
>>
>> On 22. Aug 2017, at 12:58, Sea aj  wrote:
>>
>> I have a large dataframe of 1 billion rows of type LabeledPoint. I tried
>> to train a linear regression model on the df but it failed due to lack of
>> memory although I'm using 9 slaves, each with 100gb of ram and 16 cores of
>> CPU.
>>
>> I decided to split my data into multiple chunks and train the model in
>> multiple phases but I learned the linear regression model in ml library
>> does not have "setinitialmodel" function to be able to pass the trained
>> model from one chunk to the rest of chunks. In another word, each time I
>> call the fit function over a chunk of my data, it overwrites the previous
>> mode.
>>
>> So far the only solution I found is using Spark Streaming to be able to
>> split the data to multiple dfs and then train over each individually to
>> overcome memory issue.
>>
>> Do you know if there's any other solution?
>>
>>
>>
>>
>> On Mon, Jul 10, 2017 at 7:57 AM, Jayant Shekhar 
>> wrote:
>>
>>> Hello Mahesh,
>>>
>>> We have built one. You can download from here :
>>> https://www.sparkflows.io/download
>>>
>>> Feel free to ping me for any questions, etc.
>>>
>>> Best Regards,
>>> Jayant
>>>
>>>
>>> On Sun, Jul 9, 2017 at 9:35 PM, Mahesh Sawaiker <
>>> mahesh_sawai...@persistent.com> wrote:
>>>
 Hi,

 1) Is anyone aware of any workbench kind of tool to run ML jobs in
 spark. Specifically is the tool  could be something like a Web application
 that is configured to connect to a spark cluster.

 User is able to select input training sets probably from hdfs , train
 and then run predictions, without having to write any Scala code.

 2) If there is not tool, is there value in having such tool, what could
 be the challenges.

 Thanks,

 Mahesh

 DISCLAIMER
 ==
 This e-mail may contain privileged and confidential information which
 is the property of Persistent Systems Ltd. It is intended only for the use
 of the individual or entity to which it is addressed. If you are not the
 intended recipient, you are not authorized to read, retain, copy, print,
 distribute or use this message. If you have received this communication in
 error, please notify the sender and delete all copies of this message.
 Persistent Systems Ltd. does not accept any liability for virus infected
 mails.

>>>
>>>
>>
>

-- 
Cheers!

Re: ORC Transaction Table - Spark

2017-08-24 Thread Akhil Das

How are you reading the data? Its clearly saying
*java.lang.NumberFormatException:
For input string: "0645253_0001" *

On Tue, Aug 22, 2017 at 7:40 PM, Aviral Agarwal 
wrote:

> Hi,
>
> I am trying to read hive orc transaction table through Spark but I am
> getting the following error
>
> Caused by: java.lang.RuntimeException: serious problem
> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSpli
> tsInfo(OrcInputFormat.java:1021)
> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(Or
> cInputFormat.java:1048)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
> .
> Caused by: java.util.concurrent.ExecutionException:
> java.lang.NumberFormatException: For input string: "0645253_0001"
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSpli
> tsInfo(OrcInputFormat.java:998)
> ... 118 more
>
> Any help would be appreciated.
>
> Thanks and Regards,
> Aviral Agarwal
>
>


-- 
Cheers!

Re: [Spark Streaming] Streaming Dynamic Allocation is broken (at least on YARN)

2017-08-24 Thread Akhil Das

Have you tried setting spark.executor.instances=0 to a positive non-zero
value? Also, since its a streaming application set executor cores > 1.

On Wed, Aug 23, 2017 at 3:38 AM, Karthik Palaniappan  wrote:

> I ran the HdfsWordCount example using this command:
>
> spark-submit run-example \
>   --conf spark.streaming.dynamicAllocation.enabled=true \
>   --conf spark.executor.instances=0 \
>   --conf spark.dynamicAllocation.enabled=false \
>   --conf spark.master=yarn \
>   --conf spark.submit.deployMode=client \
>   org.apache.spark.examples.streaming.HdfsWordCount /foo
>
> I tried it on both Spark 2.1.1 (through HDP 2.6) and Spark 2.2.0 (through
> Google Dataproc 1.2), and I get the same message repeatedly that Spark
> cannot allocate any executors.
>
> 17/08/22 19:34:57 INFO org.spark_project.jetty.util.log: Logging
> initialized @1694ms
> 17/08/22 19:34:57 INFO org.spark_project.jetty.server.Server:
> jetty-9.3.z-SNAPSHOT
> 17/08/22 19:34:57 INFO org.spark_project.jetty.server.Server: Started
> @1756ms
> 17/08/22 19:34:57 INFO org.spark_project.jetty.server.AbstractConnector:
> Started ServerConnector@578782d6{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
> 17/08/22 19:34:58 INFO 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase:
> GHFS version: 1.6.1-hadoop2
> 17/08/22 19:34:58 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting
> to ResourceManager at hadoop-m/10.240.1.92:8032
> 17/08/22 19:35:00 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl:
> Submitted application application_1503036971561_0022
> 17/08/22 19:35:04 WARN org.apache.spark.streaming.StreamingContext:
> Dynamic Allocation is enabled for this application. Enabling Dynamic
> allocation for Spark Streaming applications can cause data loss if Write
> Ahead Log is not enabled for non-replayable sources like Flume. See the
> programming guide for details on how to enable the Write Ahead Log.
> 17/08/22 19:35:21 WARN org.apache.spark.scheduler.cluster.YarnScheduler:
> Initial job has not accepted any resources; check your cluster UI to ensure
> that workers are registered and have sufficient resources
> 17/08/22 19:35:36 WARN org.apache.spark.scheduler.cluster.YarnScheduler:
> Initial job has not accepted any resources; check your cluster UI to ensure
> that workers are registered and have sufficient resources
> 17/08/22 19:35:51 WARN org.apache.spark.scheduler.cluster.YarnScheduler:
> Initial job has not accepted any resources; check your cluster UI to ensure
> that workers are registered and have sufficient resources
>
> I confirmed that the YARN cluster has enough memory for dozens of
> executors, and verified that the application allocates executors when using
> Core's spark.dynamicAllocation.enabled=true, and leaving spark.streaming.
> dynamicAllocation.enabled=false.
>
> Is streaming dynamic allocation actually supported? Sean Owen suggested it
> might have been experimental: https://issues.apache.org/jira/browse/SPARK-
> 21792.
>



-- 
Cheers!

Re: Remote RPC client disassociated

2016-07-01 Thread Akhil Das

Can you try the Cassandra connector 1.5? It is also compatible with Spark
1.6 according to their documentation
https://github.com/datastax/spark-cassandra-connector#version-compatibility
You can also crosspost it over here
https://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user

On Fri, Jul 1, 2016 at 5:45 PM, Joaquin Alzola <joaquin.alz...@lebara.com>
wrote:

> HI Akhil
>
>
>
> I am using:
>
> Cassandra: 3.0.5
>
> Spark: 1.6.1
>
> Scala 2.10
>
> Spark-cassandra connector: 1.6.0
>
>
>
> *From:* Akhil Das [mailto:ak...@hacked.work]
> *Sent:* 01 July 2016 11:38
> *To:* Joaquin Alzola <joaquin.alz...@lebara.com>
> *Cc:* user@spark.apache.org
> *Subject:* Re: Remote RPC client disassociated
>
>
>
> This looks like a version conflict, which version of spark are you using?
> The Cassandra connector you are using is for Scala 2.10x and Spark 1.6
> version.
>
>
>
> On Thu, Jun 30, 2016 at 6:34 PM, Joaquin Alzola <joaquin.alz...@lebara.com>
> wrote:
>
> HI List,
>
>
>
> I am launching this spark-submit job:
>
>
>
> hadoop@testbedocg:/mnt/spark> bin/spark-submit --packages
> com.datastax.spark:spark-cassandra-connector_2.10:1.6.0 --jars
> /mnt/spark/lib/TargetHolding_pyspark-cassandra-0.3.5.jar spark_v2.py
>
>
>
> spark_v2.py is:
>
> from pyspark_cassandra import CassandraSparkContext, Row
>
> from pyspark import SparkContext, SparkConf
>
> from pyspark.sql import SQLContext
>
> conf = SparkConf().setAppName("test").setMaster("spark://
> 192.168.23.31:7077").set("spark.cassandra.connection.host",
> "192.168.23.31")
>
> sc = CassandraSparkContext(conf=conf)
>
> table =
> sc.cassandraTable("lebara_diameter_codes","nl_lebara_diameter_codes")
>
> food_count = table.select("errorcode2001").groupBy("errorcode2001").count()
>
> food_count.collect()
>
>
>
>
>
> Error I get when running the above command:
>
>
>
> [Stage 0:>  (0 +
> 3) / 7]16/06/30 10:40:36 ERROR TaskSchedulerImpl: Lost executor 0 on as5:
> Remote RPC client disassociated. Likely due to containers exceeding
> thresholds, or network issues. Check driver logs for WARN messages.
>
> [Stage 0:>  (0 +
> 7) / 7]16/06/30 10:40:40 ERROR TaskSchedulerImpl: Lost executor 1 on as4:
> Remote RPC client disassociated. Likely due to containers exceeding
> thresholds, or network issues. Check driver logs for WARN messages.
>
> [Stage 0:>  (0 +
> 5) / 7]16/06/30 10:40:42 ERROR TaskSchedulerImpl: Lost executor 3 on as5:
> Remote RPC client disassociated. Likely due to containers exceeding
> thresholds, or network issues. Check driver logs for WARN messages.
>
> [Stage 0:>  (0 +
> 4) / 7]16/06/30 10:40:46 ERROR TaskSchedulerImpl: Lost executor 4 on as4:
> Remote RPC client disassociated. Likely due to containers exceeding
> thresholds, or network issues. Check driver logs for WARN messages.
>
> 16/06/30 10:40:46 ERROR TaskSetManager: Task 5 in stage 0.0 failed 4
> times; aborting job
>
> Traceback (most recent call last):
>
>   File "/mnt/spark-1.6.1-bin-hadoop2.6/spark_v2.py", line 11, in 
>
> food_count =
> table.select("errorcode2001").groupBy("errorcode2001").count()
>
>   File "/mnt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1004, in
> count
>
>   File "/mnt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 995, in sum
>
>   File "/mnt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 869, in
> fold
>
>   File "/mnt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 771, in
> collect
>
>   File "/mnt/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
> 813, in __call__
>
>   File "/mnt/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line
> 308, in get_return_value
>
> py4j.protocol.Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
>
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 5 in stage 0.0 failed 4 times, most recent failure: Lost task 5.3 in stage
> 0.0 (TID 14, as4): ExecutorLostFailure (executor 4 exited caused by one of
> the running tasks) Reason: Remote RPC client disassociated. Likely due to
> containers exceeding thresholds, or network issues. Check driver log

Re: How to spin up Kafka using docker and use for Spark Streaming Integration tests

2016-07-01 Thread Akhil Das

You can use this https://github.com/wurstmeister/kafka-docker to spin up a
kafka cluster and then point your sparkstreaming to it to consume from it.

On Fri, Jul 1, 2016 at 1:19 AM, SRK  wrote:

> Hi,
>
> I need to do integration tests using Spark Streaming. My idea is to spin up
> kafka using docker locally and use it to feed the stream to my Streaming
> Job. Any suggestions on how to do this would be of great help.
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-spin-up-Kafka-using-docker-and-use-for-Spark-Streaming-Integration-tests-tp27252.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Cheers!

Re: RDD to DataFrame question with JsValue in the mix

2016-07-01 Thread Akhil Das

Something like this?

import sqlContext.implicits._
case class Holder(str: String, js:JsValue)

yourRDD.map(x => Holder(x._1, x._2)).toDF()



On Fri, Jul 1, 2016 at 3:36 AM, Dood@ODDO  wrote:

> Hello,
>
> I have an RDD[(String,JsValue)] that I want to convert into a DataFrame
> and then run SQL on. What is the easiest way to get the JSON (in form of
> JsValue) "understood" by the process?
>
> Thanks!
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Cheers!

Re: Remote RPC client disassociated

2016-07-01 Thread Akhil Das

This looks like a version conflict, which version of spark are you using?
The Cassandra connector you are using is for Scala 2.10x and Spark 1.6
version.

On Thu, Jun 30, 2016 at 6:34 PM, Joaquin Alzola 
wrote:

> HI List,
>
>
>
> I am launching this spark-submit job:
>
>
>
> hadoop@testbedocg:/mnt/spark> bin/spark-submit --packages
> com.datastax.spark:spark-cassandra-connector_2.10:1.6.0 --jars
> /mnt/spark/lib/TargetHolding_pyspark-cassandra-0.3.5.jar spark_v2.py
>
>
>
> spark_v2.py is:
>
> from pyspark_cassandra import CassandraSparkContext, Row
>
> from pyspark import SparkContext, SparkConf
>
> from pyspark.sql import SQLContext
>
> conf = SparkConf().setAppName("test").setMaster("spark://
> 192.168.23.31:7077").set("spark.cassandra.connection.host",
> "192.168.23.31")
>
> sc = CassandraSparkContext(conf=conf)
>
> table =
> sc.cassandraTable("lebara_diameter_codes","nl_lebara_diameter_codes")
>
> food_count = table.select("errorcode2001").groupBy("errorcode2001").count()
>
> food_count.collect()
>
>
>
>
>
> Error I get when running the above command:
>
>
>
> [Stage 0:>  (0 +
> 3) / 7]16/06/30 10:40:36 ERROR TaskSchedulerImpl: Lost executor 0 on as5:
> Remote RPC client disassociated. Likely due to containers exceeding
> thresholds, or network issues. Check driver logs for WARN messages.
>
> [Stage 0:>  (0 +
> 7) / 7]16/06/30 10:40:40 ERROR TaskSchedulerImpl: Lost executor 1 on as4:
> Remote RPC client disassociated. Likely due to containers exceeding
> thresholds, or network issues. Check driver logs for WARN messages.
>
> [Stage 0:>  (0 +
> 5) / 7]16/06/30 10:40:42 ERROR TaskSchedulerImpl: Lost executor 3 on as5:
> Remote RPC client disassociated. Likely due to containers exceeding
> thresholds, or network issues. Check driver logs for WARN messages.
>
> [Stage 0:>  (0 +
> 4) / 7]16/06/30 10:40:46 ERROR TaskSchedulerImpl: Lost executor 4 on as4:
> Remote RPC client disassociated. Likely due to containers exceeding
> thresholds, or network issues. Check driver logs for WARN messages.
>
> 16/06/30 10:40:46 ERROR TaskSetManager: Task 5 in stage 0.0 failed 4
> times; aborting job
>
> Traceback (most recent call last):
>
>   File "/mnt/spark-1.6.1-bin-hadoop2.6/spark_v2.py", line 11, in 
>
> food_count =
> table.select("errorcode2001").groupBy("errorcode2001").count()
>
>   File "/mnt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1004, in
> count
>
>   File "/mnt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 995, in sum
>
>   File "/mnt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 869, in
> fold
>
>   File "/mnt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 771, in
> collect
>
>   File "/mnt/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
> 813, in __call__
>
>   File "/mnt/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line
> 308, in get_return_value
>
> py4j.protocol.Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
>
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 5 in stage 0.0 failed 4 times, most recent failure: Lost task 5.3 in stage
> 0.0 (TID 14, as4): ExecutorLostFailure (executor 4 exited caused by one of
> the running tasks) Reason: Remote RPC client disassociated. Likely due to
> containers exceeding thresholds, or network issues. Check driver logs for
> WARN messages.
>
> Driver stacktrace:
>
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
>
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
>
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
>
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>
> at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
>
>at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
>
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
>
> at scala.Option.foreach(Option.scala:236)
>
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
>
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
>
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
>
> at
>

Re: Spark Task is not created

2016-06-25 Thread Akhil Das

Would be good if you can paste the piece of code that you are executing.

On Sun, Jun 26, 2016 at 11:21 AM, Ravindra 
wrote:

> Hi All,
>
> May be I need to just set some property or its a known issue. My spark
> application hangs in test environment whenever I see following message -
>
> 16/06/26 11:13:34 INFO DAGScheduler: *Submitting 2 missing tasks from
> ShuffleMapStage* 145 (MapPartitionsRDD[590] at rdd at
> WriteDataFramesDecorator.scala:61)
> 16/06/26 11:13:34 INFO TaskSchedulerImpl: Adding task set 145.0 with 2
> tasks
> 16/06/26 11:13:34 INFO TaskSetManager: Starting task 0.0 in stage 145.0
> (TID 186, localhost, PROCESS_LOCAL, 2389 bytes)
> 16/06/26 11:13:34 INFO Executor: Running task 0.0 in stage 145.0 (TID 186)
> 16/06/26 11:13:34 INFO BlockManager: Found block rdd_575_0 locally
> 16/06/26 11:13:34 INFO GenerateMutableProjection: Code generated in 3.796
> ms
> 16/06/26 11:13:34 INFO Executor: Finished task 0.0 in stage 145.0 (TID
> 186). 2578 bytes result sent to driver
> 16/06/26 11:13:34 INFO TaskSetManager: Finished task 0.0 in stage 145.0
> (TID 186) in 24 ms on localhost (1/2)
>
> It happens with any action. The application works fine whenever I notice 
> "*Submitting
> 1 missing tasks from ShuffleMapStage". *For this I need to tweak the plan
> like using repartition, coalesce etc but this also doesn't help always.
>
> Some of the Spark properties are as given below -
>
> NameValue
> spark.app.idlocal-1466914377931
> spark.app.name  SparkTest
> spark.cores.max  3
> spark.default.parallelism 1
> spark.driver.allowMultipleContexts true
> spark.executor.iddriver
> spark.externalBlockStore.folderName
> spark-050049bd-c058-4035-bc3d-2e73a08e8d0c
> spark.masterlocal[2]
> spark.scheduler.mode FIFO
> spark.ui.enabledtrue
>
>
> Thanks,
> Ravi.
>
>


-- 
Cheers!

Re: Unable to acquire bytes of memory

2016-06-21 Thread Akhil Das

Looks like this issue https://issues.apache.org/jira/browse/SPARK-10309

On Mon, Jun 20, 2016 at 4:27 PM, pseudo oduesp 
wrote:

> Hi ,
> i don t have no idea why i get this error
>
>
>
> Py4JJavaError: An error occurred while calling o69143.parquet.
> : org.apache.spark.SparkException: Job aborted.
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
> at
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
> at
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
> at
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
> at
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
> at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)
> at sun.reflect.GeneratedMethodAccessor91.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
> at py4j.Gateway.invoke(Gateway.java:259)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage
> failure: Task 8 in stage 10645.0 failed 4 times, most recent failure: Lost
> task 8.3 in stage 10645.0 (TID 536592,
> prssnbd1s006.bigplay.bigdata.intraxa): java.io.IOException: Unable to
> acquire 67108864 bytes of memory
> at
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:351)
> at
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
> at
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
> at
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
> at org.apache.spark.sql.execution.TungstenSort.org
> $apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
> at
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:59)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at

Re: Unsubscribe

2016-06-21 Thread Akhil Das

You need to send an email to user-unsubscr...@spark.apache.org for
unsubscribing. Read more over here http://spark.apache.org/community.html

On Mon, Jun 20, 2016 at 1:10 PM, Ram Krishna 
wrote:

> Hi Sir,
>
> Please unsubscribe me
>
> --
> Regards,
> Ram Krishna KT
>
>
>
>
>
>


-- 
Cheers!

Re: Spark not using all the cluster instances in AWS EMR

2016-06-18 Thread Akhil Das

spark.executor.instances is the parameter that you are looking for. Read
more here http://spark.apache.org/docs/latest/running-on-yarn.html

On Sun, Jun 19, 2016 at 2:17 AM, Natu Lauchande 
wrote:

> Hi,
>
> I am running some spark loads . I notice that in  it only uses one of the
> machines(instead of the 3 available) of the cluster.
>
> Is there any parameter that can be set to force it to use all the cluster.
>
> I am using AWS EMR with Yarn.
>
>
> Thanks,
> Natu
>
>
>
>
>
>
>

-- 
Cheers!

Re: spark streaming - how to purge old data files in data directory

2016-06-18 Thread Akhil Das

Currently, there is no out of the box solution for this. Although, you can
use other hdfs utils to remove older files from the directory (say 24hrs
old). Another approach is discussed here

.

On Sun, Jun 19, 2016 at 7:28 AM, Vamsi Krishna 
wrote:

> Hi,
>
> I'm on HDP 2.3.2 cluster (Spark 1.4.1).
> I have a spark streaming app which uses 'textFileStream' to stream simple
> CSV files and process.
> I see the old data files that are processed are left in the data directory.
> What is the right way to purge the old data files in data directory on
> HDFS?
>
> Thanks,
> Vamsi Attluri
> --
> Vamsi Attluri
>

-- 
Cheers!

Re: Running JavaBased Implementationof StreamingKmeans

2016-06-18 Thread Akhil Das

SparkStreaming does not pick up old files by default, so you need to start
your job with master=local[2] (It needs 2 or more working threads, 1 to
read the files and the other to do your computation) and once the job start
to run, place your input files in the input directories and you can see
them being picked up by sparkstreaming.

On Sun, Jun 19, 2016 at 12:37 AM, Biplob Biswas <revolutioni...@gmail.com>
wrote:

> Hi,
>
> I tried local[*] and local[2] and the result is the same. I don't really
> understand the problem here.
> How can I confirm that the files are read properly?
>
> Thanks & Regards
> Biplob Biswas
>
> On Sat, Jun 18, 2016 at 5:59 PM, Akhil Das <ak...@hacked.work> wrote:
>
>> Looks like you need to set your master to local[2] or local[*]
>>
>> On Sat, Jun 18, 2016 at 4:54 PM, Biplob Biswas <revolutioni...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I implemented the streamingKmeans example provided in the spark website
>>> but
>>> in Java.
>>> The full implementation is here,
>>>
>>> http://pastebin.com/CJQfWNvk
>>>
>>> But i am not getting anything in the output except occasional timestamps
>>> like one below:
>>>
>>> ---
>>> Time: 1466176935000 ms
>>> ---
>>>
>>> Also, i have 2 directories:
>>> "D:\spark\streaming example\Data Sets\training"
>>> "D:\spark\streaming example\Data Sets\test"
>>>
>>> and inside these directories i have 1 file each "samplegpsdata_train.txt"
>>> and "samplegpsdata_test.txt" with training data having 500 datapoints and
>>> test data with 60 datapoints.
>>>
>>> I am very new to the spark systems and any help is highly appreciated.
>>>
>>> Thank you so much
>>> Biplob Biswas
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Running-JavaBased-Implementationof-StreamingKmeans-tp27192.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>>
>> --
>> Cheers!
>>
>>
>


-- 
Cheers!

Re: Many executors with the same ID in web UI (under Executors)?

2016-06-18 Thread Akhil Das

A screenshot of the executor tab will explain it better. Usually executors
are allocated when the job is started, if you have a multi-node cluster
then you'll see executors launched on different nodes.

On Sat, Jun 18, 2016 at 9:04 PM, Jacek Laskowski  wrote:

> Hi,
>
> This is for Spark on YARN - a 1-node cluster with Spark 2.0.0-SNAPSHOT
> (today build)
>
> I can understand that when a stage fails a new executor entry shows up
> in web UI under Executors tab (that corresponds to a stage attempt). I
> understand that this is to keep the stdout and stderr logs for future
> reference.
>
> Why are there multiple executor entries under the same executor IDs?
> What are the executor entries exactly? When are the new ones created
> (after a Spark application is launched and assigned the
> --num-executors executors)?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Cheers!

Re: Running JavaBased Implementationof StreamingKmeans

2016-06-18 Thread Akhil Das

Looks like you need to set your master to local[2] or local[*]

On Sat, Jun 18, 2016 at 4:54 PM, Biplob Biswas 
wrote:

> Hi,
>
> I implemented the streamingKmeans example provided in the spark website but
> in Java.
> The full implementation is here,
>
> http://pastebin.com/CJQfWNvk
>
> But i am not getting anything in the output except occasional timestamps
> like one below:
>
> ---
> Time: 1466176935000 ms
> ---
>
> Also, i have 2 directories:
> "D:\spark\streaming example\Data Sets\training"
> "D:\spark\streaming example\Data Sets\test"
>
> and inside these directories i have 1 file each "samplegpsdata_train.txt"
> and "samplegpsdata_test.txt" with training data having 500 datapoints and
> test data with 60 datapoints.
>
> I am very new to the spark systems and any help is highly appreciated.
>
> Thank you so much
> Biplob Biswas
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Running-JavaBased-Implementationof-StreamingKmeans-tp27192.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Cheers!

Re: Getting NPE when trying to do spark streaming with Twitter

2016-04-11 Thread Akhil Das

Looks like a yarn issue to me, Can you try checking out this code?
https://github.com/akhld/sparkstreaming-twitter just git clone and do a sbt
run after configuring your credentials in the main file

.

Thanks
Best Regards

On Mon, Apr 11, 2016 at 11:29 AM, krisgari  wrote:

>  I am new to SparkStreaming, when tried to submit the Spark-Twitter
> streaming
> job, getting the following error:
> ---
> Lost task 0.0 in stage 0.0 (TID
> 0,sandbox.hortonworks.com):java.lang.NullPointerException
> at org.apache.spark.util.Utils$.decodeFileNameInURI(Utils.scala:340)
> at org.apache.spark.util.Utils$.fetchFile(Utils.scala:365)
> at
>
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:404)
> at
>
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:396)
> at
>
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
> at
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
> at
>
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> at
> org.apache.spark.executor.Executor.org
> $apache$spark$executor$Executor$$updateDependencies(Executor.scala:396)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:192)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745
> --
>
> Here is the code snippet:
> --
> val Array(consumerKey, consumerSecret, accessToken, accessTokenSecret) =
> args.take(4)
> val filters = args.takeRight(args.length - 4)
>
> System.setProperty("twitter4j.oauth.consumerKey", consumerKey)
> System.setProperty("twitter4j.oauth.consumerSecret", consumerSecret)
> System.setProperty("twitter4j.oauth.accessToken", accessToken)
> System.setProperty("twitter4j.oauth.accessTokenSecret", accessTokenSecret)
> val sparkConf = new SparkConf().setAppName("TwitterPopularTags")
> val ssc = new StreamingContext(sparkConf,  Seconds(2))
> val stream = TwitterUtils.createStream(ssc,None, filters)
> val hashTags = stream.flatMap(status => status.getText.split("
> ").filter(_.startsWith("#")))
> val topCounts60 = hashTags.map((_, 1)).reduceByKeyAndWindow(_ + _,
> Seconds(60))
>  .map{case (topic, count) => (count, topic)}
>  .transform(_.sortByKey(false))
> topCounts60.foreachRDD(rdd => {
>   val topList = rdd.take(10)
>   println("\nPopular topics in last 60 seconds (%s
> total):".format(rdd.count()))
>   topList.foreach{case (count, tag) => println("%s (%s tweets)".format(tag,
> count))}
> })
> ssc.start()
> ssc.awaitTermination()
>
> --
> command to submit the job
> --
> ./bin/spark-submit --class
> "org.apache.spark.examples.streaming.TwitterPopularTags"  --master
> yarn-client --num-executors 3 --driver-memory 512m --executor-memory 512m
> --executor-cores 1  --jars
>
> /home/spark/.sbt/0.13/staging/0ff3ad537358b61f617c/twitterstream/target/scala-2.10/twitterstream-project_2.10-1.0.jar,/home/spark/.ivy2/cache/org.apache.spark/spark-streaming-twitter_2.10/jars/spark-streaming-twitter_2.10-1.6.1.jar,/home/spark/.ivy2/cache/org.twitter4j/twitter4j-core/jars/twitter4j-core-4.0.4.jar,/home/spark/.ivy2/cache/org.twitter4j/twitter4j-stream/jars/twitter4j-stream-4.0.4.jar,/home/spark/.ivy2/cache/org.apache.spark/spark-streaming_2.10/jars/spark-streaming_2.10-1.6.1.jar,/home/spark/.ivy2/cache/org.apache.spark/spark-core_2.10/jars/spark-core_2.10-1.6.1.jar
> "sandbox.hortonworks.com:6667" xx xx xx xx
> --
>
> Any clue why I am getting this NPE?? Any help on how to debug this further?
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Getting-NPE-when-trying-to-do-spark-streaming-with-Twitter-tp26737.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark not handling Null

2016-04-11 Thread Akhil Das

Surround it with a try..catch where its complaining about the null pointer
to avoid the job being failed. What is happening here is like you are
returning null and the following operation is working on null which causes
the job to fail.

Thanks
Best Regards

On Mon, Apr 11, 2016 at 12:51 PM, saurabh guru 
wrote:

> Trying to run following causes a NullPointer Exception. While I thought
> Spark should have been able to handle Null, apparently it is not able to.
> What could I return in place of null? What other ways could I approach
> this?? There are at times, I would want to just skip parsing and proceed to
> next record, how do I handle that?
>
>
> filtered.mapToPair(new Parse(imprDetailMarshal)).reduceByKey(new
> TimeAndCount()).foreachRDD(new
> ImpressionDetailLogPub(indexNamePrefix,indexType, imprDetailMarshal));
> }
>
>
>
> *   Where Parse is:*
>
> private static class Parse implements PairFunction ImpressionDetailRecord, TimeNcount> {
>
> private static final long serialVersionUID = -5060508551208900848L;
> private static final DateTimeFormatter FORMATTER =
> DateTimeFormat.forPattern("dd/MMM/:HH:mm:ss ZZ");
> private final ImprDetailMarshal imprDetailMarshal;
>
> public Parse(ImprDetailMarshal imprDetailMarshal){
> this.imprDetailMarshal = imprDetailMarshal;
> }
>
> @Override
> public Tuple2 call(String
> arg0) throws Exception {
>
>   ImpressionDetailRecordHolder recordHolder =
> imprDetailMarshal.parse(arg0);
> if(recordHolder != null)
> {
> return new Tuple2 TimeNcount>(recordHolder.getImpressionDetailRecord(),recordHolder.getTimeCount());
> }
>return null;
> }
> }
>
> java.lang.NullPointerException
> at
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
> at
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> --
> Thanks,
> Saurabh
>
> :)
>

Re:

2016-04-04 Thread Akhil Das

1 core with 4 partitions means it executes it one by one, not parallel. For
the Kafka question, if you don't have higher data volume then you may not
need 40 partitions.

Thanks
Best Regards

On Sat, Apr 2, 2016 at 7:35 PM, Hemalatha A <
hemalatha.amru...@googlemail.com> wrote:

> Hello,
>
> As per Spark programming guide, it says "we should have 2-4 partitions for
> each CPU in your cluster.". In this case how does 1 CPU core process 2-4
> partitions at the same time?
> Link - http://spark.apache.org/docs/latest/programming-guide.html (under
> Rdd section)
>
> Does it do context switching between tasks or run them in parallel? If it
> does context switching how is it efficient compared to 1:1 partition vs
> Core?
>
> PS: If we are using Kafka direct API  in which kafka partitions=  Rdd
> partitions. Does that mean we should give 40 kafka partitions for 10 CPU
> Cores?
>
> --
>
>
> Regards
> Hemalatha
>

Re: Read Parquet in Java Spark

2016-04-04 Thread Akhil Das

I wasn't knowing you have a parquet file containing json data.

Thanks
Best Regards

On Mon, Apr 4, 2016 at 2:44 PM, Ramkumar V <ramkumar.c...@gmail.com> wrote:

> Hi Akhil,
>
> Thanks for your help. Why do you put separator as "," ?
>
> I have a parquet file which contains only json in each line.
>
> *Thanks*,
> <https://in.linkedin.com/in/ramkumarcs31>
>
>
> On Mon, Apr 4, 2016 at 2:34 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> Something like this (in scala):
>>
>> val rdd = parquetFile.javaRDD().map(row => row.mkstring(","))
>>
>> You can create a map operation over your javaRDD to convert the
>> org.apache.spark.sql.Row
>> <https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/Row.html>
>> to String (the Row.mkstring() Operation)
>>
>> Thanks
>> Best Regards
>>
>> On Mon, Apr 4, 2016 at 12:02 PM, Ramkumar V <ramkumar.c...@gmail.com>
>> wrote:
>>
>>> Any idea on this ? How to convert parquet file into JavaRDD ?
>>>
>>> *Thanks*,
>>> <https://in.linkedin.com/in/ramkumarcs31>
>>>
>>>
>>> On Thu, Mar 31, 2016 at 4:30 PM, Ramkumar V <ramkumar.c...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Thanks for the reply.  I tried this. It's returning JavaRDD
>>>> instead of JavaRDD. How to get JavaRDD ?
>>>>
>>>> Error :
>>>> incompatible types:
>>>> org.apache.spark.api.java.JavaRDD cannot be
>>>> converted to org.apache.spark.api.java.JavaRDD
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *Thanks*,
>>>> <https://in.linkedin.com/in/ramkumarcs31>
>>>>
>>>>
>>>> On Thu, Mar 31, 2016 at 2:57 PM, UMESH CHAUDHARY <umesh9...@gmail.com>
>>>> wrote:
>>>>
>>>>> From Spark Documentation:
>>>>>
>>>>> DataFrame parquetFile = sqlContext.read().parquet("people.parquet");
>>>>>
>>>>> JavaRDD jRDD= parquetFile.javaRDD()
>>>>>
>>>>> javaRDD() method will convert the DF to RDD
>>>>>
>>>>> On Thu, Mar 31, 2016 at 2:51 PM, Ramkumar V <ramkumar.c...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm trying to read parquet log files in Java Spark. Parquet log files
>>>>>> are stored in hdfs. I want to read and convert that parquet file into
>>>>>> JavaRDD. I could able to find Sqlcontext dataframe api. How can I read if
>>>>>> it is sparkcontext and rdd ? what is the best way to read it ?
>>>>>>
>>>>>> *Thanks*,
>>>>>> <https://in.linkedin.com/in/ramkumarcs31>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Read Parquet in Java Spark

2016-04-04 Thread Akhil Das

Something like this (in scala):

val rdd = parquetFile.javaRDD().map(row => row.mkstring(","))

You can create a map operation over your javaRDD to convert the
org.apache.spark.sql.Row

to String (the Row.mkstring() Operation)

Thanks
Best Regards

On Mon, Apr 4, 2016 at 12:02 PM, Ramkumar V  wrote:

> Any idea on this ? How to convert parquet file into JavaRDD ?
>
> *Thanks*,
> 
>
>
> On Thu, Mar 31, 2016 at 4:30 PM, Ramkumar V 
> wrote:
>
>> Hi,
>>
>> Thanks for the reply.  I tried this. It's returning JavaRDD instead
>> of JavaRDD. How to get JavaRDD ?
>>
>> Error :
>> incompatible types:
>> org.apache.spark.api.java.JavaRDD cannot be
>> converted to org.apache.spark.api.java.JavaRDD
>>
>>
>>
>>
>>
>> *Thanks*,
>> 
>>
>>
>> On Thu, Mar 31, 2016 at 2:57 PM, UMESH CHAUDHARY 
>> wrote:
>>
>>> From Spark Documentation:
>>>
>>> DataFrame parquetFile = sqlContext.read().parquet("people.parquet");
>>>
>>> JavaRDD jRDD= parquetFile.javaRDD()
>>>
>>> javaRDD() method will convert the DF to RDD
>>>
>>> On Thu, Mar 31, 2016 at 2:51 PM, Ramkumar V 
>>> wrote:
>>>
 Hi,

 I'm trying to read parquet log files in Java Spark. Parquet log files
 are stored in hdfs. I want to read and convert that parquet file into
 JavaRDD. I could able to find Sqlcontext dataframe api. How can I read if
 it is sparkcontext and rdd ? what is the best way to read it ?

 *Thanks*,
 


>>>
>>
>

Re: Spark streaming spilling all the data to disk even if memory available

2016-03-31 Thread Akhil Das

Use StorageLevel MEMORY_ONLY. Also have a look at the createDirectStream
API. Most likely in your case your batch duration must be less than your
processing time and the addition of delay probably blows up the memory.
On Mar 31, 2016 6:13 PM, "Mayur Mohite" <mayur.moh...@applift.com> wrote:

> We are using KafkaUtils.createStream API to read data from kafka topics
> and we are using StorageLevel.MEMORY_AND_DISK_SER option while configuring
> kafka streams.
>
> On Wed, Mar 30, 2016 at 12:58 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> Can you elaborate more on from where you are streaming the data and what
>> type of consumer you are using etc?
>>
>> Thanks
>> Best Regards
>>
>> On Tue, Mar 29, 2016 at 6:10 PM, Mayur Mohite <mayur.moh...@applift.com>
>> wrote:
>>
>>> Hi,
>>>
>>> We are running spark streaming app on a single machine and we have
>>> configured spark executor memory to 30G.
>>> We noticed that after running the app for 12 hours, spark streaming
>>> started spilling ALL the data to disk even though we have configured
>>> sufficient memory for spark to use for storage.
>>>
>>> -Mayur
>>>
>>> Learn more about our inaugural *FirstScreen Conference
>>> <http://www.firstscreenconf.com/>*!
>>> *Where the worlds of mobile advertising and technology meet!*
>>>
>>> June 15, 2016 @ Urania Berlin
>>>
>>
>>
>
>
> --
> *Mayur Mohite*
> Senior Software Engineer
>
> Phone: +91 9035867742
> Skype: mayur.mohite_applift
>
>
> *AppLift India*
> 107/3, 80 Feet Main Road,
> Koramangala 4th Block,
> Bangalore - 560034
> www.AppLift.com <http://www.applift.com/>
>
>
> Learn more about our inaugural *FirstScreen Conference
> <http://www.firstscreenconf.com/>*!
> *Where the worlds of mobile advertising and technology meet!*
>
> June 15, 2016 @ Urania Berlin
>

Re: Spark streaming spilling all the data to disk even if memory available

2016-03-30 Thread Akhil Das

Can you elaborate more on from where you are streaming the data and what
type of consumer you are using etc?

Thanks
Best Regards

On Tue, Mar 29, 2016 at 6:10 PM, Mayur Mohite 
wrote:

> Hi,
>
> We are running spark streaming app on a single machine and we have
> configured spark executor memory to 30G.
> We noticed that after running the app for 12 hours, spark streaming
> started spilling ALL the data to disk even though we have configured
> sufficient memory for spark to use for storage.
>
> -Mayur
>
> Learn more about our inaugural *FirstScreen Conference
> *!
> *Where the worlds of mobile advertising and technology meet!*
>
> June 15, 2016 @ Urania Berlin
>

Re: Unable to Limit UI to localhost interface

2016-03-30 Thread Akhil Das

In your case, you will be able to see the webui (unless restricted with
iptables) but you won't be able to submit jobs to that machine from a
remote machine since the spark master is spark://127.0.0.1:7077

Thanks
Best Regards

On Tue, Mar 29, 2016 at 8:12 PM, David O'Gwynn  wrote:

> /etc/hosts
>
> 127.0.0.1 localhost
>
> conf/slaves
> 127.0.0.1
>
>
> On Mon, Mar 28, 2016 at 5:36 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> in your /etc/hosts what do you have for localhost
>>
>> 127.0.0.1 localhost.localdomain localhost
>>
>> conf/slave should have one entry in your case
>>
>> cat slaves
>> # A Spark Worker will be started on each of the machines listed below.
>> localhost
>> ...
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 28 March 2016 at 15:32, David O'Gwynn  wrote:
>>
>>> Greetings to all,
>>>
>>> I've search around the mailing list, but it would seem that (nearly?)
>>> everyone has the opposite problem as mine. I made a stab at looking in the
>>> source for an answer, but I figured I might as well see if anyone else has
>>> run into the same problem as I.
>>>
>>> I'm trying to limit my Master/Worker UI to run only on localhost. As it
>>> stands, I have the following two environment variables set in my
>>> spark-env.sh:
>>>
>>> SPARK_LOCAL_IP=127.0.0.1
>>> SPARK_MASTER_IP=127.0.0.1
>>>
>>> and my slaves file contains one line: 127.0.0.1
>>>
>>> The problem is that when I run "start-all.sh", I can nmap my box's
>>> public interface and get the following:
>>>
>>> PORT STATE SERVICE
>>> 22/tcp   open  ssh
>>> 8080/tcp open  http-proxy
>>> 8081/tcp open  blackice-icecap
>>>
>>> Furthermore, I can go to my box's public IP at port 8080 in my browser
>>> and get the master node's UI. The UI even reports that the URL/REST URLs to
>>> be 127.0.0.1:
>>>
>>> Spark Master at spark://127.0.0.1:7077
>>> URL: spark://127.0.0.1:7077
>>> REST URL: spark://127.0.0.1:6066 (cluster mode)
>>>
>>> I'd rather not have spark available in any way to the outside world
>>> without an explicit SSH tunnel.
>>>
>>> There are variables to do with setting the Web UI port, but I'm not
>>> concerned with the port, only the network interface to which the Web UI
>>> binds.
>>>
>>> Any help would be greatly appreciated.
>>>
>>>
>>
>

Re: Master options Cluster/Client descrepencies.

2016-03-30 Thread Akhil Das

Have a look at
http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211

Thanks
Best Regards

On Wed, Mar 30, 2016 at 12:09 AM, satyajit vegesna <
satyajit.apas...@gmail.com> wrote:

>
> Hi All,
>
> I have written a spark program on my dev box ,
>IDE:Intellij
>scala version:2.11.7
>spark verison:1.6.1
>
> run fine from IDE, by providing proper input and output paths including
>  master.
>
> But when i try to deploy the code in my cluster made of below,
>
>Spark version:1.6.1
> built from source pkg using scala 2.11
> But when i try spark-shell on cluster i get scala version to be
> 2.10.5
>  hadoop yarn cluster 2.6.0
>
> and with additional options,
>
> --executor-memory
> --total-executor-cores
> --deploy-mode cluster/client
> --master yarn
>
> i get Exception in thread "main" java.lang.NoSuchMethodError:
> scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
> at com.movoto.SparkPost$.main(SparkPost.scala:36)
> at com.movoto.SparkPost.main(SparkPost.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> i understand this to be a scala version issue, as i have faced this before.
>
> Is there something that i have change and try  things to get the same
> program running on cluster.
>
> Regards,
> Satyajit.
>
>

Re: aggregateByKey on PairRDD

2016-03-30 Thread Akhil Das

Isn't it what tempRDD.groupByKey does?

Thanks
Best Regards

On Wed, Mar 30, 2016 at 7:36 AM, Suniti Singh 
wrote:

> Hi All,
>
> I have an RDD having the data in  the following form :
>
> tempRDD: RDD[(String, (String, String))]
>
> (brand , (product, key))
>
> ("amazon",("book1","tech"))
>
> ("eBay",("book1","tech"))
>
> ("barns",("book","tech"))
>
> ("amazon",("book2","tech"))
>
>
> I would like to group the data by Brand and would like to get the result
> set in the following format :
>
> resultSetRDD : RDD[(String, List[(String), (String)]
>
> i tried using the aggregateByKey but kind  of not getting how to achieve
> this. OR is there any other way to achieve this?
>
> val resultSetRDD  = tempRDD.aggregateByKey("")({case (aggr , value) =>
> aggr + String.valueOf(value) + ","}, (aggr1, aggr2) => aggr1 + aggr2)
>
> resultSetRDD = (amazon,("book1","tech"),("book2","tech"))
>
> Thanks,
>
> Suniti
>

Re: Null pointer exception when using com.databricks.spark.csv

2016-03-30 Thread Akhil Das

Looks like the winutils.exe is missing from the environment, See
https://issues.apache.org/jira/browse/SPARK-2356

Thanks
Best Regards

On Wed, Mar 30, 2016 at 10:44 AM, Selvam Raman  wrote:

> Hi,
>
> i am using spark 1.6.0 prebuilt hadoop 2.6.0 version in my windows machine.
>
> i was trying to use databricks csv format to read csv file. i used the
> below command.
>
> [image: Inline image 1]
>
> I got null pointer exception. Any help would be greatly appreciated.
>
> [image: Inline image 2]
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>

Re: Issue with wholeTextFiles

2016-03-22 Thread Akhil Das

Can you paste the exception stack here?

Thanks
Best Regards

On Mon, Mar 21, 2016 at 1:42 PM, Sarath Chandra <
sarathchandra.jos...@algofusiontech.com> wrote:

> I'm using Hadoop 1.0.4 and Spark 1.2.0.
>
> I'm facing a strange issue. I have a requirement to read a small file from
> HDFS and all it's content has to be read at one shot. So I'm using spark
> context's wholeTextFiles API passing the HDFS URL for the file.
>
> When I try this from a spark shell it's works as mentioned in the
> documentation, but when I try the same through program (by submitting job
> to cluster) I get FileNotFoundException. I have all compatible JARs in
> place.
>
> Please help.
>
>
>

Re: pyspark sql convert long to timestamp?

2016-03-22 Thread Akhil Das

Have a look at the from_unixtime() functions.
https://spark.apache.org/docs/1.5.0/api/python/_modules/pyspark/sql/functions.html#from_unixtime

Thanks
Best Regards

On Tue, Mar 22, 2016 at 4:49 AM, Andy Davidson <
a...@santacruzintegration.com> wrote:

> Any idea how I have a col in a data frame that is of type long any idea
> how I create a column who’s type is time stamp?
>
> The long is unix epoch in ms
>
> Thanks
>
> Andy
>

Re: Setting up spark to run on two nodes

2016-03-21 Thread Akhil Das

You can simply execute the sbin/start-slaves.sh file to start up all slave
processes. Just make sure you have spark installed on the same path on all
the machines.

Thanks
Best Regards

On Sat, Mar 19, 2016 at 4:01 AM, Ashok Kumar 
wrote:

> Experts.
>
> Please your valued advice.
>
> I have spark 1.5.2 set up as standalone for now and I have started the
> master as below
>
> start-master.sh
>
> I also have modified config/slave file to have
>
> # A Spark Worker will be started on each of the machines listed below.
> localhost
> workerhost
>
>
> On the localhost I start slave as follows:
>
> start-slave.sh spark:localhost:7077
>
> Questions.
>
> If I want worker process to be started not only on localhost but also
> workerhost
>
> 1) Do I need just to do start-slave.sh on localhost and it will start the
> worker process on other node -> workerhost
> 2) Do I have to runt start-slave.sh spark:workerhost:7077 as well locally
> on workerhost
> 3) On GUI http:// 
> localhost:4040/environment/
> I do not see any reference to worker process running on workerhost
>
> Appreciate any help on how to go about starting the master on localhost
> and starting two workers one on localhost and the other on workerhost
>
> Thanking you
>
>

Re: Potential conflict with org.iq80.snappy in Spark 1.6.0 environment?

2016-03-21 Thread Akhil Das

Looks like a jar conflict, could you paste the piece of code? and how your
dependency file looks like?

Thanks
Best Regards

On Sat, Mar 19, 2016 at 7:49 AM, vasu20  wrote:

> Hi,
>
> I have some code that parses a snappy thrift file for objects.  This code
> works fine when run standalone (outside of the Spark environment).
> However,
> when running from within Spark, I get an IllegalAccessError exception from
> the org.iq80.snappy package.  Has anyone else seen this error and/or do you
> have any suggestions?  Any pointers appreciated.  Thanks!
>
> Vasu
>
> --
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due
> to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure:
> Lost task 0.0 in stage 0.0 (TID 0, localhost):
> java.lang.IllegalAccessError:
> tried to access class org.iq80.snappy.BufferRecycler from class
> org.iq80.snappy.AbstractSnappyInputStream
> at
>
> org.iq80.snappy.AbstractSnappyInputStream.(AbstractSnappyInputStream.java:91)
> at
>
> org.iq80.snappy.SnappyFramedInputStream.(SnappyFramedInputStream.java:38)
> at DistMatchMetric$1.call(DistMatchMetric.java:131)
> at DistMatchMetric$1.call(DistMatchMetric.java:123)
> at
>
> org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1015)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at
>
> scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:172)
> at
> scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1157)
> at
>
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$14.apply(RDD.scala:1011)
> at
>
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$14.apply(RDD.scala:1009)
> at
> org.apache.spark.SparkContext$$anonfun$36.apply(SparkContext.scala:1951)
> at
> org.apache.spark.SparkContext$$anonfun$36.apply(SparkContext.scala:1951)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Potential-conflict-with-org-iq80-snappy-in-Spark-1-6-0-environment-tp26539.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Building spark submodule source code

2016-03-21 Thread Akhil Das

Have a look at the intellij setup
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ
Once you have the setup ready, you don't have to recompile the whole stuff
every time.

Thanks
Best Regards

On Mon, Mar 21, 2016 at 8:14 AM, Tenghuan He  wrote:

> Hi everyone,
>
> I am trying to add a new method to spark RDD. After changing the code
> of RDD.scala and running the following command
> mvn -pl :spark-core_2.10 -DskipTests clean install
> It BUILD SUCCESS, however, when starting the bin\spark-shell, my
> method cannot be found.
> Do I have to rebuild the whole spark project instead the spark-core
> submodule to make the changes work?
> Rebuiling the whole project is too time consuming, is there any better
> choice?
>
>
> Thanks & Best Regards
>
> Tenghuan He
>
>

Re: Error using collectAsMap() in scala

2016-03-21 Thread Akhil Das

What you should be doing is a join, something like this:

//Create a key, value pair, key being the column1
val rdd1 = sc.textFile(file1).map(x => (x.split(",")(0),x.split(","))

//Create a key, value pair, key being the column2
val rdd2 = sc.textFile(file2).map(x => (x.split(",")(1),x.split(","))

//Now join the dataset
val joined = rdd1.join(rdd2)

//Now do the replacement
val replaced = joined.map(...)





Thanks
Best Regards

On Mon, Mar 21, 2016 at 10:31 AM, Shishir Anshuman <
shishiranshu...@gmail.com> wrote:

> I have stored the contents of two csv files in separate RDDs.
>
> file1.csv format*: (column1,column2,column3)*
> file2.csv format*: (column1, column2)*
>
> *column1 of file1 *and* column2 of file2 *contains similar data. I want
> to compare the two columns and if match is found:
>
>- Replace the data at *column1(file1)* with the* column1(file2)*
>
>
> For this reason, I am not using normal RDD.
>
> I am still new to apache spark, so any suggestion will be greatly
> appreciated.
>
> On Mon, Mar 21, 2016 at 10:09 AM, Prem Sure  wrote:
>
>> any specific reason you would like to use collectasmap only? You probably
>> move to normal RDD instead of a Pair.
>>
>>
>> On Monday, March 21, 2016, Mark Hamstra  wrote:
>>
>>> You're not getting what Ted is telling you.  Your `dict` is an
>>> RDD[String]  -- i.e. it is a collection of a single value type, String.
>>> But `collectAsMap` is only defined for PairRDDs that have key-value pairs
>>> for their data elements.  Both a key and a value are needed to collect into
>>> a Map[K, V].
>>>
>>> On Sun, Mar 20, 2016 at 8:19 PM, Shishir Anshuman <
>>> shishiranshu...@gmail.com> wrote:
>>>
 yes I have included that class in my code.
 I guess its something to do with the RDD format. Not able to figure out
 the exact reason.

 On Fri, Mar 18, 2016 at 9:27 AM, Ted Yu  wrote:

> It is defined in:
> core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
>
> On Thu, Mar 17, 2016 at 8:55 PM, Shishir Anshuman <
> shishiranshu...@gmail.com> wrote:
>
>> I am using following code snippet in scala:
>>
>>
>> *val dict: RDD[String] = sc.textFile("path/to/csv/file")*
>> *val dict_broadcast=sc.broadcast(dict.collectAsMap())*
>>
>> On compiling It generates this error:
>>
>> *scala:42: value collectAsMap is not a member of
>> org.apache.spark.rdd.RDD[String]*
>>
>>
>> *val dict_broadcast=sc.broadcast(dict.collectAsMap())
>> ^*
>>
>
>

>>>
>

Re: unsubscribe

2016-03-15 Thread Akhil Das

Send an email to user-unsubscr...@spark.apache.org for unsubscribing. Read
more over here http://spark.apache.org/community.html


Thanks
Best Regards

On Tue, Mar 15, 2016 at 1:28 PM, Netwaver  wrote:

> unsubscribe
>
>
>
>

Re: unsubscribe

2016-03-15 Thread Akhil Das

Send an email to user-unsubscr...@spark.apache.org for unsubscribing. Read
more over here http://spark.apache.org/community.html

Thanks
Best Regards

On Tue, Mar 15, 2016 at 12:56 PM, satish chandra j  wrote:

> unsubscribe
>

Re: Compare a column in two different tables/find the distance between column data

2016-03-15 Thread Akhil Das

You can achieve this with the normal RDD way. Have one extra stage in the
pipeline where you will properly standardize all the values (like replacing
doc with doctor) for all the columns before the join.

Thanks
Best Regards

On Tue, Mar 15, 2016 at 9:16 AM, Suniti Singh 
wrote:

> Hi All,
>
> I have two tables with same schema but different data. I have to join the
> tables based on one column and then do a group by the same column name.
>
> now the data in that column in two table might/might not exactly match.
> (Ex - column name is "title". Table1. title = "doctor"   and Table2. title
> = "doc") doctor and doc are actually same titles.
>
> From performance point of view where i have data volume in TB , i am not
> sure if i can achieve this using the sql statement. What would be the best
> approach of solving this problem. Should i look for MLLIB apis?
>
> Spark Gurus any pointers?
>
> Thanks,
> Suniti
>
>
>

Re: create hive context in spark application

2016-03-15 Thread Akhil Das

Did you ry submitting your application with spark-submit
? You can
also try opening a spark-shell and see if it picks up your hive-site.xml.

Thanks
Best Regards

On Tue, Mar 15, 2016 at 11:58 AM, antoniosi  wrote:

> Hi,
>
> I am trying to connect to a hive metastore deployed in a oracle db. I have
> the hive configuration
> specified in the hive-site.xml. I put the hive-site.xml under
> $SPARK_HOME/conf. If I run spark-shell,
> everything works fine. I can create hive database, tables and query the
> tables.
>
> However, when I try to do that in a spark application, running in local
> mode, i.e., I have
> sparkConf.setMaster("local[*]").setSparkHome(),
> it does not seem
> to pick up the hive-site.xml. It still uses the local derby Hive metastore
> instead of the oracle
> metastore that I defined in hive-site.xml. If I add the hive-site.xml
> explicitly on the classpath, I am
> getting the following error:
>
> Caused by:
> org.datanucleus.api.jdo.exceptions.TransactionNotActiveException:
> Transaction is not active. You either need to define a transaction around
> this, or run your PersistenceManagerFactory with 'NontransactionalRead' and
> 'NontransactionalWrite' set to 'true'
> FailedObject:org.datanucleus.exceptions.TransactionNotActiveException:
> Transaction is not active. You either need to define a transaction around
> this, or run your PersistenceManagerFactory with 'NontransactionalRead' and
> 'NontransactionalWrite' set to 'true'
> at
>
> org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:396)
> at
> org.datanucleus.api.jdo.JDOTransaction.rollback(JDOTransaction.java:186)
> at
>
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.runTestQuery(MetaStoreDirectSql.java:204)
> at
>
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.(MetaStoreDirectSql.java:137)
> at
>
> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:295)
> at
> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
> at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
> at
>
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
> at
>
> org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57)
> at
>
> org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)
> at
>
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593)
> at
>
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571)
> at
>
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:624)
> at
>
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461)
> at
>
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:66)
> at
>
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
> at
>
> org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
> at
>
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:199)
> at
>
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
>
> This happens when I try to new a HiveContext in my code.
>
> How do I ask Spark to look at the hive-site.xml in the $SPARK_HOME/conf
> directory in my spark application?
>
> Thanks very much. Any pointer will be much appreciated.
>
> Regards,
>
> Antonio.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/create-hive-context-in-spark-application-tp26496.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Streaming app consume multiple kafka topics

2016-03-15 Thread Akhil Das

One way would be to keep it this way:

val stream1 = KafkaUtils.createStream(..) // for topic 1

val stream2 = KafkaUtils.createStream(..) // for topic 2


And you will know which stream belongs to which topic.

Another approach which you can put in your code itself would be to tag the
topic name along with the stream that you are creating. Like, create a
tuple(topic, stream) and you will be able to access ._1 as topic and ._2 as
the stream.


Thanks
Best Regards

On Tue, Mar 15, 2016 at 12:05 PM, Imre Nagi  wrote:

> Hi,
>
> I'm just trying to create a spark streaming application that consumes more
> than one topics sent by kafka. Then, I want to do different further
> processing for data sent by each topic.
>
> val kafkaStreams = {
>>   val kafkaParameter = for (consumerGroup <- consumerGroups) yield {
>> Map(
>>   "metadata.broker.list" -> ConsumerConfig.metadataBrokerList,
>>   "zookeeper.connect" -> ConsumerConfig.zookeeperConnect,
>>   "group.id" -> consumerGroup,
>>   "zookeeper.connection.timeout.ms" ->
>> ConsumerConfig.zookeeperConnectionTimeout,
>>   "schema.registry.url" -> ConsumerConfig.schemaRegistryUrl,
>>   "auto.offset.reset" -> ConsumerConfig.autoOffsetReset
>> )
>>   }
>>   val streams = (0 to kafkaParameter.length - 1) map { p =>
>> KafkaUtils.createStream[String, Array[Byte], StringDecoder,
>> DefaultDecoder](
>>   ssc,
>>   kafkaParameter(p),
>>   Map(topicsArr(p) -> 1),
>>   StorageLevel.MEMORY_ONLY_SER
>> ).map(_._2)
>>   }
>>   val unifiedStream = ssc.union(streams)
>>   unifiedStream.repartition(1)
>> }
>> kafkaStreams.foreachRDD(rdd => {
>>   rdd.foreachPartition(partitionOfRecords => {
>> partitionOfRecords.foreach ( x =>
>>   println(x)
>> )
>>   })
>> })
>
>
> So far, I'm able to get the data from several topic. However, I'm still
> unable to
> differentiate the data sent from a topic with another.
>
> Do anybody has an experience in doing this stuff?
>
> Best,
> Imre
>

Can someone fix this download URL?

2016-03-13 Thread Akhil Das

http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz

[image: Inline image 1]

There's a broken link for Spark 1.6.1 prebuilt hadoop 2.6 direct download.


Thanks
Best Regards

Re: Sample project on Image Processing

2016-02-22 Thread Akhil Das

What type of Image processing are you doing? Here's a simple example with
Tensorflow
https://databricks.com/blog/2016/01/25/deep-learning-with-spark-and-tensorflow.html

Thanks
Best Regards

On Mon, Feb 22, 2016 at 1:53 PM, Mishra, Abhishek  wrote:

> Hello,
>
> I am working on image processing samples. Was wondering if anyone has
> worked on Image processing project in spark. Please let me know if any
> sample project or example is available.
>
>
>
> Please guide in this.
>
> Sincerely,
>
> Abhishek
>

Re: [Example] : read custom schema from file

2016-02-22 Thread Akhil Das

If you are talking about a CSV kind of file, then here's an example
http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection

Thanks
Best Regards

On Mon, Feb 22, 2016 at 1:10 PM, Divya Gehlot 
wrote:

> Hi,
> Can anybody help me by providing  me example how can we read schema of the
> data set from the file.
>
>
>
> Thanks,
> Divya
>

Re: How to start spark streaming application with recent past timestamp for replay of old batches?

2016-02-21 Thread Akhil Das

On Mon, Feb 22, 2016 at 12:18 PM, ashokkumar rajendran <
ashokkumar.rajend...@gmail.com> wrote:

> Hi Folks,
>
>
>
> I am exploring spark for streaming from two sources (a) Kinesis and (b)
> HDFS for some of our use-cases. Since we maintain state gathered over last
> x hours in spark streaming, we would like to replay the data from last x
> hours as batches during deployment. I have gone through the Spark APIs but
> could not find anything that initiates with older timestamp. Appreciate
> your input on the same.
>
> 1.  Restarting with check-pointing runs the batches faster for missed
> timestamp period, but when we upgrade with new code, the same checkpoint
> directory cannot be reused.
>
=> It is true that you won't be able to use the checkpoint when you
upgrade your code, the production codes are not upgraded every now and
then. You can basically create a configuration file in which you can put
most of then stuffs (like streaming duration, parameters etc) instead of
updating them in the code and breaking the checkpoint. 


> 2.  For the case with kinesis as source, we can change the last
> checked sequence number in DynamoDB to get the data from last x hours, but
> this will be one large bunch of data for first restarted batch. So the data
> is not processed as natural multiple batches inside spark.
>
=> The kinesis API has a way to limit the data rate, you might want to
look into that and implement a custom receiver for your use-case.

http://docs.aws.amazon.com/kinesis/latest/APIReference/API_GetRecords.html

> 3.  For the source from HDFS, I could not find any alternative to
> start the streaming from old timestamp data, unless I manually (or with
> script) rename the old files after starting the stream. (This workaround
> leads to other complications too further).
>
=> What you can do is, once the files are processed, you can move them to
a different directory and when you restart the stream for whatever reason,
you can make it pick all the files instead of the latest ones (by passing
the *newFilesOnly* boolean param)


> 4.  May I know how is the zero data loss is achieved while having
> hdfs as source? i.e. if the driver fails while processing a micro batch,
> what happens when the application is restarted? Is the same micro-batch
> reprocessed?
>
=> Yes, If the application is restarted then the micro-batch will be
reprocessed.


>
>
> Regards
>
> Ashok
>

Re: Need help :Does anybody has HDP cluster on EC2?

2016-02-15 Thread Akhil Das

According to the documentation
<https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml>,
the hostname that you are seeing for those properties are inherited
from  *yarn.nodemanager.hostname
*if your requirement is just to see the logs, then you can ssh-tunnel to
the machine (ssh -L 8042:127.0.0.1:8042 mastermachine)

Thanks
Best Regards

On Mon, Feb 15, 2016 at 2:29 PM, Divya Gehlot <divya.htco...@gmail.com>
wrote:

> Yes you are correct Akhil
> But I unable to find that property
> Please find attached screen shots for more details.
>
> Thanks,
> Divya
>
>
>
> On 15 February 2016 at 16:37, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> You can set *yarn.nodemanager.webapp.address* in the
>> yarn-site.xml/yarn-default.xml file to change it I guess.
>>
>> Thanks
>> Best Regards
>>
>> On Mon, Feb 15, 2016 at 1:55 PM, Divya Gehlot <divya.htco...@gmail.com>
>> wrote:
>>
>> > Hi,
>> > I have hadoop cluster set up in EC2.
>> > I am unable to view application logs in Web UI as its taking internal IP
>> > Like below :
>> > http://ip-xxx-xx-xx-xxx.ap-southeast-1.compute.internal:8042
>> > <http://ip-172-31-22-136.ap-southeast-1.compute.internal:8042/>
>> >
>> > How can I change this to external one or redirecting to external ?
>> > Attached screenshots for better understanding of my issue.
>> >
>> > Would really appreciate help.
>> >
>> >
>> > Thanks,
>> > Divya
>> >
>> >
>> >
>>
>
>

Re: Need help :Does anybody has HDP cluster on EC2?

2016-02-15 Thread Akhil Das

You can set *yarn.nodemanager.webapp.address* in the
yarn-site.xml/yarn-default.xml file to change it I guess.

Thanks
Best Regards

On Mon, Feb 15, 2016 at 1:55 PM, Divya Gehlot 
wrote:

> Hi,
> I have hadoop cluster set up in EC2.
> I am unable to view application logs in Web UI as its taking internal IP
> Like below :
> http://ip-xxx-xx-xx-xxx.ap-southeast-1.compute.internal:8042
> 
>
> How can I change this to external one or redirecting to external ?
> Attached screenshots for better understanding of my issue.
>
> Would really appreciate help.
>
>
> Thanks,
> Divya
>
>
>

Re: spark-streaming with checkpointing: error with sparkOnHBase lib

2016-01-27 Thread Akhil Das

Were you able to resolve this? It'd be good if you can paste the code
snippet to reproduce this.

Thanks
Best Regards

On Fri, Jan 22, 2016 at 2:06 PM, vinay gupta 
wrote:

> Hi,
>   I have a spark-streaming application which uses sparkOnHBase lib to do
> streamBulkPut()
>
> Without checkpointing everything works fine.. But recently upon enabling
> checkpointing I got the
> following exception -
>
> 16/01/22 01:32:35 ERROR executor.Executor: Exception in task 0.0 in stage
> 39.0 (TID 134)
> java.lang.ClassCastException: [B cannot be cast to
> org.apache.spark.SerializableWritable
> at
> com.cloudera.spark.hbase.HBaseContext.applyCreds(HBaseContext.scala:225)
> at com.cloudera.spark.hbase.HBaseContext.com
> $cloudera$spark$hbase$HBaseContext$$hbaseForeachPartition(HBaseContext.scala:633)
> at
> com.cloudera.spark.hbase.HBaseContext$$anonfun$com$cloudera$spark$hbase$HBaseContext$$bulkMutation$1.apply(HBaseContext.scala:460)
> at
> com.cloudera.spark.hbase.HBaseContext$$anonfun$com$cloudera$spark$hbase$HBaseContext$$bulkMutation$1.apply(HBaseContext.scala:460)
> at
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:806)
> at
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:806)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> Any pointers from previous users of sparkOnHbase lib ??
>
> Thanks,
> -Vinay
>
>

Re: MemoryStore: Not enough space to cache broadcast_N in memory

2016-01-27 Thread Akhil Das

Did you try enabling spark.memory.useLegacyMode and
upping spark.storage.memoryFraction?

Thanks
Best Regards

On Fri, Jan 22, 2016 at 3:40 AM, Arun Luthra  wrote:

> WARN MemoryStore: Not enough space to cache broadcast_4 in memory!
> (computed 60.2 MB so far)
> WARN MemoryStore: Persisting block broadcast_4 to disk instead.
>
>
> Can I increase the memory allocation for broadcast variables?
>
> I have a few broadcast variables that I create with sc.broadcast() . Are
> these labeled starting from 0 or from 1 (in reference to "broadcast_N")? I
> want to debug/track down which one is offending.
>
> As a feature request, it would be good if there were an optional argument
> (or perhaps a requireed argument) added to sc.broadcast() so that we could
> give it an internal label. Then it would work the same as the
> sc.accumulator() "name" argument. It would enable more useful warn/error
> messages.
>
> Arun
>

Re: Generate Amplab queries set

2016-01-27 Thread Akhil Das

Have a look at the TPC-H queries, I found this repository with the quries.
https://github.com/ssavvides/tpch-spark

Thanks
Best Regards

On Fri, Jan 22, 2016 at 1:35 AM, sara mustafa 
wrote:

> Hi,
> I have downloaded the Amplab benchmark dataset from
> s3n://big-data-benchmark/pavlo/text/tiny, but I don't know how to generate
> a
> set of random mixed queries of different types like scan,aggregate and
> join.
>
> Thanks,
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Generate-Amplab-queries-set-tp16071.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Spark SQL . How to enlarge output rows ?

2016-01-27 Thread Akhil Das

Why would you want to print all rows? You can try the following:

sqlContext.sql("select day_time from my_table limit
10").collect().foreach(println)



Thanks
Best Regards

On Sun, Jan 24, 2016 at 5:58 PM, Eli Super  wrote:

> Unfortunately still getting error when use .show() with `false` or `False`
> or `FALSE`
>
> Py4JError: An error occurred while calling o153.showString. Trace:
> py4j.Py4JException: Method showString([class java.lang.String, class 
> java.lang.Boolean]) does not exist
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
>   at py4j.Gateway.invoke(Gateway.java:252)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
>
>
> On Thu, Jan 21, 2016 at 4:54 PM, Spencer, Alex (Santander) <
> alex.spen...@santander.co.uk> wrote:
>
>> I forgot to add this is (I think) from 1.5.0.
>>
>>
>>
>> And yeah that looks like a Python – I’m not hot with Python but it may be
>> capitalised as False or FALSE?
>>
>>
>>
>>
>>
>> *From:* Eli Super [mailto:eli.su...@gmail.com]
>> *Sent:* 21 January 2016 14:48
>> *To:* Spencer, Alex (Santander)
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: Spark SQL . How to enlarge output rows ?
>>
>>
>>
>> Thanks Alex
>>
>>
>>
>> I get NameError
>>
>> NameError: name 'false' is not defined
>>
>> Is it because of PySpark ?
>>
>>
>>
>>
>>
>> On Thu, Jan 14, 2016 at 3:34 PM, Spencer, Alex (Santander) <
>> alex.spen...@santander.co.uk> wrote:
>>
>> Hi,
>>
>>
>>
>> Try …..show(*false*)
>>
>>
>>
>> public void show(int numRows,
>>
>> boolean truncate)
>>
>>
>>
>>
>>
>> Kind Regards,
>>
>> Alex.
>>
>>
>>
>> *From:* Eli Super [mailto:eli.su...@gmail.com]
>> *Sent:* 14 January 2016 13:09
>> *To:* user@spark.apache.org
>> *Subject:* Spark SQL . How to enlarge output rows ?
>>
>>
>>
>>
>>
>> Hi
>>
>>
>>
>> After executing sql
>>
>>
>>
>> sqlContext.sql("select day_time from my_table limit 10").show()
>>
>>
>>
>> my output looks like  :
>>
>> ++
>>
>> |  day_time|
>>
>> ++
>>
>> |2015/12/15 15:52:...|
>>
>> |2015/12/15 15:53:...|
>>
>> |2015/12/15 15:52:...|
>>
>> |2015/12/15 15:52:...|
>>
>> |2015/12/15 15:52:...|
>>
>> |2015/12/15 15:52:...|
>>
>> |2015/12/15 15:51:...|
>>
>> |2015/12/15 15:52:...|
>>
>> |2015/12/15 15:52:...|
>>
>> |2015/12/15 15:53:...|
>>
>> ++
>>
>>
>>
>> I'd like to get full rows
>>
>> Thanks !
>>
>> Emails aren't always secure, and they may be intercepted or changed after
>> they've been sent. Santander doesn't accept liability if this happens. If
>> you
>> think someone may have interfered with this email, please get in touch
>> with the
>> sender another way. This message doesn't create or change any contract.
>> Santander doesn't accept responsibility for damage caused by any viruses
>> contained in this email or its attachments. Emails may be monitored. If
>> you've
>> received this email by mistake, please let the sender know at once that
>> it's
>> gone to the wrong person and then destroy it without copying, using, or
>> telling
>> anyone about its contents.
>> Santander UK plc Reg. No. 2294747 and Abbey National Treasury Services
>> plc Reg.
>> No. 2338548 Registered Offices: 2 Triton Square, Regent's Place, London
>> NW1 3AN.
>> Registered in England. www.santander.co.uk. Authorised by the Prudential
>> Regulation Authority and regulated by the Financial Conduct Authority and
>> the
>> Prudential Regulation Authority. FCA Reg. No. 106054 and 146003
>> respectively.
>> Santander Sharedealing is a trading name of Abbey Stockbrokers Limited
>> Reg. No.
>> 02666793. Registered Office: Kingfisher House, Radford Way, Billericay,
>> Essex
>> CM12 0GZ. Authorised and regulated by the Financial Conduct Authority.
>> FCA Reg.
>> No. 154210. You can check this on the Financial Services Register by
>> visiting
>> the FCA’s website www.fca.org.uk/register or by contacting the FCA on
>> 0800 111
>> 6768. Santander UK plc is also licensed by the Financial Supervision
>> Commission
>> of the Isle of Man for its branch in the Isle of Man. Deposits held with
>> the
>> Isle of Man branch are covered by the Isle of Man Depositors’ Compensation
>> Scheme as set out in the Isle of Man Depositors’ Compensation Scheme
>> Regulations
>> 2010. In the Isle of Man, Santander UK plc’s principal place of business
>> is at
>> 19/21 Prospect Hill, Douglas, Isle of Man, IM1 1ET. Santander and the
>> flame logo
>> are registered trademarks.
>> Santander Asset Finance plc. Reg. No. 1533123. Registered Office: 2 Triton
>> Square, Regent’s Place, London NW1 3AN. Registered in England. Santander
>> Corporate & Commercial is a brand name used by Santander UK plc, Abbey
>>

Re: spark streaming input rate strange

2016-01-27 Thread Akhil Das

How are you verifying the data dropping? Can you send 10k, 20k events and
write the same to an output location from spark streaming and verify it? If
you are finding a data mismatch then its a problem with your
MulticastSocket implementation.

Thanks
Best Regards

On Fri, Jan 22, 2016 at 5:44 PM, patcharee 
wrote:

> Hi,
>
> I have a streaming application with
> - 1 sec interval
> - accept data from a simulation through MulticastSocket
>
> The simulation sent out data using multiple clients/threads every 1 sec
> interval. The input rate accepted by the streaming looks strange.
> - When clients = 10,000 the event rate raises up to 10,000, stays at
> 10,000 a while and drops to about 7000-8000.
> - When clients = 20,000 the event rate raises up to 20,000, stays at
> 20,000 a while and drops to about 15000-17000. The same pattern
>
> Processing time is just about 400 ms.
>
> Any ideas/suggestions?
>
> Thanks,
> Patcharee
>

Re: How to send a file to database using spark streaming

2016-01-27 Thread Akhil Das

This is a good start
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/8_streaming.md

Thanks
Best Regards

On Sat, Jan 23, 2016 at 12:19 PM, Sree Eedupuganti  wrote:

> New to Spark Streaming. My question is i want to load the XML files to
> database [cassandra] using spark streaming.Any suggestions please.Thanks in
> Advance.
>
> --
> Best Regards,
> Sreeharsha Eedupuganti
> Data Engineer
> innData Analytics Private Limited
>

Re: Debug what is replication Level of which RDD

2016-01-27 Thread Akhil Das

How many RDDs are you persisting? If its 2, then you can verify it by
disabling the persist for one of them and from the UI you can see which one
of mappedRDD/shuffledRDD.

Thanks
Best Regards

On Sun, Jan 24, 2016 at 3:25 AM, gaurav sharma 
wrote:

> Hi All,
>
> I have enabled replication for my RDDs.
>
> I see on the Storage tab of the Spark UI, which mentions the replication
> level 2x or 1x.
>
> But the names given are mappedRDD, shuffledRDD, I am not able to debug
> which of my RDD is 2n replicated, and which one is 1x.
>
> Please help.
>
> Regards,
> Gaurab
>

Re: spark job submisson on yarn-cluster mode failing

2016-01-21 Thread Akhil Das

Can you look in the executor logs and see why the sparkcontext is being
shutdown? Similar discussion happened here previously.
http://apache-spark-user-list.1001560.n3.nabble.com/RECEIVED-SIGNAL-15-SIGTERM-td23668.html

Thanks
Best Regards

On Thu, Jan 21, 2016 at 5:11 PM, Soni spark 
wrote:

> Hi Friends,
>
> I spark job is successfully running on local mode but failing on cluster 
> mode. Below is the error message i am getting. anyone can help me.
>
>
>
> 16/01/21 16:38:07 INFO twitter4j.TwitterStreamImpl: Establishing connection.
> 16/01/21 16:38:07 INFO twitter.TwitterReceiver: Twitter receiver started
> 16/01/21 16:38:07 INFO receiver.ReceiverSupervisorImpl: Called receiver 
> onStart
> 16/01/21 16:38:07 INFO receiver.ReceiverSupervisorImpl: Waiting for receiver 
> to be stopped*16/01/21 16:38:10 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 
> 15: SIGTERM*
> 16/01/21 16:38:10 INFO streaming.StreamingContext: Invoking 
> stop(stopGracefully=false) from shutdown hook
> 16/01/21 16:38:10 INFO scheduler.ReceiverTracker: Sent stop signal to all 1 
> receivers
> 16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Received stop signal
> 16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Stopping receiver 
> with message: Stopped by driver:
> 16/01/21 16:38:10 INFO twitter.TwitterReceiver: Twitter receiver stopped
> 16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Called receiver onStop
> 16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Deregistering 
> receiver 0*16/01/21 16:38:10 ERROR scheduler.ReceiverTracker: Deregistered 
> receiver for stream 0: Stopped by driver*
> 16/01/21 16:38:10 INFO receiver.ReceiverSupervisorImpl: Stopped receiver 0
> 16/01/21 16:38:10 INFO receiver.BlockGenerator: Stopping BlockGenerator
> 16/01/21 16:38:10 INFO yarn.ApplicationMaster: Waiting for spark context 
> initialization ...
>
> Thanks
>
> Soniya
>
>

Re: Parquet write optimization by row group size config

2016-01-20 Thread Akhil Das

It would be good if you can share the code, someone here or I can guide you
better if you can post the code snippet.

Thanks
Best Regards

On Wed, Jan 20, 2016 at 10:54 PM, Pavel Plotnikov <
pavel.plotni...@team.wrike.com> wrote:

> Thanks, Akhil! It helps, but this jobs still not fast enough, maybe i
> missed something
>
> Regards,
> Pavel
>
> On Wed, Jan 20, 2016 at 9:51 AM Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> Did you try re-partitioning the data before doing the write?
>>
>> Thanks
>> Best Regards
>>
>> On Tue, Jan 19, 2016 at 6:13 PM, Pavel Plotnikov <
>> pavel.plotni...@team.wrike.com> wrote:
>>
>>> Hello,
>>> I'm using spark on some machines in standalone mode, data storage is
>>> mounted on this machines via nfs. A have input data stream and when i'm
>>> trying to store all data for hour in parquet, a job executes mostly on one
>>> core and this hourly data are stored in 40- 50 minutes. It is very slow!
>>> And it is not IO problem. After research how parquet file works, i'm found
>>> that it can be parallelized on row group abstraction level.
>>> I think row group for my files is to large, and how can i change it?
>>> When i create to big DataFrame i devides in parts very well and writes
>>> quikly!
>>>
>>> Thanks,
>>> Pavel
>>>
>>
>>

Re: Appending filename information to RDD initialized by sc.textFile

2016-01-19 Thread Akhil Das

You can use the sc.newAPIHadoopFile and pass your own InputFormat and
RecordReader which will read the compressed .gz files to your usecase. For
a start, you can look at the:

- wholeTextFile implementation

- WholeTextFileInputFormat

- WholeTextFileRecordReader






Thanks
Best Regards

On Tue, Jan 19, 2016 at 11:48 PM, Femi Anthony  wrote:

>
>
>  I  have a set of log files I would like to read into an RDD. These files
> are all compressed .gz and are the filenames are date stamped. The source
> of these files is the page view statistics data for wikipedia
>
> http://dumps.wikimedia.org/other/pagecounts-raw/
>
> The file names look like this:
>
> pagecounts-20090501-00.gz
> pagecounts-20090501-01.gz
> pagecounts-20090501-02.gz
>
> What I would like to do is read in all such files in a directory and
> prepend the date from the filename (e.g. 20090501) to each row of the
> resulting RDD. I first thought of using *sc.wholeTextFiles(..)* instead of
>  *sc.textFile(..)*, which creates a PairRDD with the key being the file
> name with a path, but*sc.wholeTextFiles()* doesn't handle compressed .gz
> files.
>
> Any suggestions would be welcome.
>
> --
> http://www.femibyte.com/twiki5/bin/view/Tech/
> http://www.nextmatrix.com
> "Great spirits have always encountered violent opposition from mediocre
> minds." - Albert Einstein.
>

Re: process of executing a program in a distributed environment without hadoop

2016-01-19 Thread Akhil Das

If you are processing a file, then you can keep the same file in all
machines in the same location and everything should work.

Thanks
Best Regards

On Wed, Jan 20, 2016 at 11:15 AM, Kamaruddin  wrote:

> I want to execute a program in a distributed environment without using
> hadoop
> and only in spark cluster. What is the best way to use it?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/process-of-executing-a-program-in-a-distributed-environment-without-hadoop-tp26015.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Parquet write optimization by row group size config

2016-01-19 Thread Akhil Das

Did you try re-partitioning the data before doing the write?

Thanks
Best Regards

On Tue, Jan 19, 2016 at 6:13 PM, Pavel Plotnikov <
pavel.plotni...@team.wrike.com> wrote:

> Hello,
> I'm using spark on some machines in standalone mode, data storage is
> mounted on this machines via nfs. A have input data stream and when i'm
> trying to store all data for hour in parquet, a job executes mostly on one
> core and this hourly data are stored in 40- 50 minutes. It is very slow!
> And it is not IO problem. After research how parquet file works, i'm found
> that it can be parallelized on row group abstraction level.
> I think row group for my files is to large, and how can i change it?
> When i create to big DataFrame i devides in parts very well and writes
> quikly!
>
> Thanks,
> Pavel
>

Re: Spark Streaming 1.5.2+Kafka+Python. Strange reading

2015-12-24 Thread Akhil Das

Would you mind posting the relevant code snippet?

Thanks
Best Regards

On Wed, Dec 23, 2015 at 7:33 PM, Vyacheslav Yanuk 
wrote:

> Hi.
> I have very strange situation with direct reading from Kafka.
> For example.
> I have 1000 messages in Kafka.
> After submitting my application I read this data and process it.
> As I process the data I have accumulated 10 new entries.
> In next reading from Kafka I read only 3 records, but not 10!!!
> Why???
> I don't understand...
> Explain to me please!
>
> --
> WBR, Vyacheslav Yanuk
> Codeminders.com
>

Re: Using inteliJ for spark development

2015-12-23 Thread Akhil Das

Both are similar, give both a go and choose the one you like.
On Dec 23, 2015 7:55 PM, "Eran Witkon" <eranwit...@gmail.com> wrote:

> Thanks, so based on that article, should I use sbt or maven? Or either?
> Eran
> On Wed, 23 Dec 2015 at 13:05 Akhil Das <ak...@sigmoidanalytics.com> wrote:
>
>> You will have to point to your spark-assembly.jar since spark has a lot
>> of dependencies. You can read the answers discussed over here to have a
>> better understanding
>> http://stackoverflow.com/questions/3589562/why-maven-what-are-the-benefits
>>
>> Thanks
>> Best Regards
>>
>> On Wed, Dec 23, 2015 at 4:27 PM, Eran Witkon <eranwit...@gmail.com>
>> wrote:
>>
>>> Thanks, all of these examples shows how to link to spark source and
>>> build it as part of my project. why should I do that? why not point
>>> directly to my spark.jar?
>>> Am I missing something?
>>> Eran
>>>
>>> On Wed, Dec 23, 2015 at 9:59 AM Akhil Das <ak...@sigmoidanalytics.com>
>>> wrote:
>>>
>>>> 1. Install sbt plugin on IntelliJ
>>>> 2. Create a new project/Import an sbt project like Dean suggested
>>>> 3. Happy Debugging.
>>>>
>>>> You can also refer to this article for more information
>>>> https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ
>>>>
>>>> Thanks
>>>> Best Regards
>>>>
>>>> On Mon, Dec 21, 2015 at 10:51 PM, Eran Witkon <eranwit...@gmail.com>
>>>> wrote:
>>>>
>>>>> Any pointers how to use InteliJ for spark development?
>>>>> Any way to use scala worksheet run like spark- shell?
>>>>>
>>>>
>>>>
>>

Re: Using inteliJ for spark development

2015-12-23 Thread Akhil Das

You will have to point to your spark-assembly.jar since spark has a lot of
dependencies. You can read the answers discussed over here to have a better
understanding
http://stackoverflow.com/questions/3589562/why-maven-what-are-the-benefits

Thanks
Best Regards

On Wed, Dec 23, 2015 at 4:27 PM, Eran Witkon <eranwit...@gmail.com> wrote:

> Thanks, all of these examples shows how to link to spark source and build
> it as part of my project. why should I do that? why not point directly to
> my spark.jar?
> Am I missing something?
> Eran
>
> On Wed, Dec 23, 2015 at 9:59 AM Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> 1. Install sbt plugin on IntelliJ
>> 2. Create a new project/Import an sbt project like Dean suggested
>> 3. Happy Debugging.
>>
>> You can also refer to this article for more information
>> https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ
>>
>> Thanks
>> Best Regards
>>
>> On Mon, Dec 21, 2015 at 10:51 PM, Eran Witkon <eranwit...@gmail.com>
>> wrote:
>>
>>> Any pointers how to use InteliJ for spark development?
>>> Any way to use scala worksheet run like spark- shell?
>>>
>>
>>

Re: Using inteliJ for spark development

2015-12-23 Thread Akhil Das

1. Install sbt plugin on IntelliJ
2. Create a new project/Import an sbt project like Dean suggested
3. Happy Debugging.

You can also refer to this article for more information
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ

Thanks
Best Regards

On Mon, Dec 21, 2015 at 10:51 PM, Eran Witkon  wrote:

> Any pointers how to use InteliJ for spark development?
> Any way to use scala worksheet run like spark- shell?
>

Re: Problem of submitting Spark task to cluster from eclipse IDE on Windows

2015-12-23 Thread Akhil Das

You need to:

1. Make sure your local router have NAT enabled and port forwarded the
networking ports listed here
.
2. Make sure on your clusters 7077 is accessible from your local (public)
ip address. You can try telnet 10.20.17.70 7077
3. Set spark.driver.host so that the cluster can connect back to your
machine.



Thanks
Best Regards

On Wed, Dec 23, 2015 at 10:02 AM, superbee84  wrote:

> Hi All,
>
>I'm new to Spark. Before I describe the problem, I'd like to let you
> know
> the role of the machines that organize the cluster and the purpose of my
> work. By reading and follwing the instructions and tutorials, I
> successfully
> built up a cluster with 7 CentOS-6.5 machines. I installed Hadoop 2.7.1,
> Spark 1.5.1, Scala 2.10.4 and ZooKeeper 3.4.5 on them. The details are
> listed as below:
>
>
> Host Name  |  IP Address  |  Hadoop 2.7.1 | Spark 1.5.1|
> ZooKeeper
> hadoop00   | 10.20.17.70  | NameNode(Active)   | Master(Active)   |   none
> hadoop01   | 10.20.17.71  | NameNode(Standby)| Master(Standby) |   none
> hadoop02   | 10.20.17.72  | ResourceManager(Active)| none  |   none
> hadoop03   | 10.20.17.73  | ResourceManager(Standby)| none|  none
> hadoop04   | 10.20.17.74  | DataNode  |  Worker  |
> JournalNode
> hadoop05   | 10.20.17.75  | DataNode  |  Worker  |
> JournalNode
> hadoop06   | 10.20.17.76  | DataNode  |  Worker  |
> JournalNode
>
>Now my *purpose* is to develop Hadoop/Spark applications on my own
> computer(IP: 10.20.6.23) and submit them to the remote cluster. As all the
> other guys in our group are in the habit of eclipse on Windows, I'm trying
> to work on this. I have successfully submitted the WordCount MapReduce job
> to YARN and it run smoothly through eclipse and Windows. But when I tried
> to
> run the Spark WordCount, it gives me the following error in the eclipse
> console:
>
> 15/12/23 11:15:30 INFO AppClient$ClientEndpoint: Connecting to master
> spark://10.20.17.70:7077...
> 15/12/23 11:15:50 ERROR SparkUncaughtExceptionHandler: Uncaught exception
> in
> thread Thread[appclient-registration-retry-thread,5,main]
> java.util.concurrent.RejectedExecutionException: Task
> java.util.concurrent.FutureTask@29ed85e7 rejected from
> java.util.concurrent.ThreadPoolExecutor@28f21632[Running, pool size = 1,
> active threads = 0, queued tasks = 0, completed tasks = 1]
> at
>
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(Unknown
> Source)
> at java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source)
> at java.util.concurrent.AbstractExecutorService.submit(Unknown
> Source)
> at
>
> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:96)
> at
>
> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:95)
> at
>
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
>
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
>
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> at
>
> org.apache.spark.deploy.client.AppClient$ClientEndpoint.tryRegisterAllMasters(AppClient.scala:95)
> at
>
> org.apache.spark.deploy.client.AppClient$ClientEndpoint.org$apache$spark$deploy$client$AppClient$ClientEndpoint$$registerWithMaster(AppClient.scala:121)
> at
>
> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2$$anonfun$run$1.apply$mcV$sp(AppClient.scala:132)
> at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1119)
> at
>
> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2.run(AppClient.scala:124)
> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown
> Source)
> at java.util.concurrent.FutureTask.runAndReset(Unknown Source)
> at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Unknown
> Source)
> at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
> Source)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> Source)
> at java.lang.Thread.run(Unknown Source)
> 15/12/23 11:15:50 INFO DiskBlockManager: Shutdown hook called
> 15/12/23 11:15:50 INFO ShutdownHookManager: Shutdown hook called
>
> Then

Re: Numpy and dynamic loading

2015-12-22 Thread Akhil Das

I guess you will have to install numpy on all the machines for this to
work. Try reinstalling on all the machines:

sudo apt-get purge python-numpy
sudo pip uninstall numpy
sudo pip install numpy


Thanks
Best Regards

On Sun, Dec 20, 2015 at 11:19 PM, Abhinav M Kulkarni <
abhinavkulka...@gmail.com> wrote:

> I am running Spark programs on a large cluster (for which, I do not have
> administrative privileges). numpy is not installed on the worker nodes.
> Hence, I bundled numpy with my program, but I get the following error:
>
> Traceback (most recent call last):
>   File "/home/user/spark-script.py", line 12, in 
> import numpy
>   File "/usr/local/lib/python2.7/dist-packages/numpy/__init__.py", line
> 170, in 
>   File "/usr/local/lib/python2.7/dist-packages/numpy/add_newdocs.py", line
> 13, in 
>   File "/usr/local/lib/python2.7/dist-packages/numpy/lib/__init__.py",
> line 8, in 
>   File "/usr/local/lib/python2.7/dist-packages/numpy/lib/type_check.py",
> line 11, in 
>   File "/usr/local/lib/python2.7/dist-packages/numpy/core/__init__.py",
> line 6, in 
> ImportError: cannot import name multiarray
>
> The script is actually quite simple:
>
> from pyspark import SparkConf, SparkContext
> sc = SparkContext()
>
> sc.addPyFile('numpy.zip')
>
> import numpy
>
> a = sc.parallelize(numpy.array([12, 23, 34, 45, 56, 67, 78, 89, 90]))
> print a.collect()
>
> I understand that the error occurs because numpy dynamically loads
> multiarray.so dependency and even if my numpy.zip file includes
> multiarray.so file, somehow the dynamic loading doesn't work with Apache
> Spark. Why so? And how do you othewise create a standalone numpymodule
> with static linking?
>
> P.S. The numpy.zip file I had included with the program was zipped version
> of the numpy installation on my Ubuntu machine. I also tried downloading
> numpy source and building it on my local machine and bundling it with the
> program, but the problem persisted. My local machine and the worker nodes
> both run Ubuntu 64.
>
> Thanks.
>

Re: I coded an example to use Twitter stream as a data source for Spark

2015-12-22 Thread Akhil Das

Why not create a custom dstream
 and
generate the data from there itself instead of spark connecting to a socket
server which will be fed by another twitter client?

Thanks
Best Regards

On Sat, Dec 19, 2015 at 5:47 PM, Amir Rahnama  wrote:

> Hi guys,
>
> Thought someone would need this:
>
> https://github.com/ambodi/realtime-spark-twitter-stream-mining
>
> you can use this approach to feed twitter stream to your spark job. So
> far, PySpark does not have a twitter dstream source.
>
>
>
> --
> Thanks and Regards,
>
> Amir Hossein Rahnama
>
> *Tel: +46 (0) 761 681 102*
> Website: www.ambodi.com
> Twitter: @_ambodi 
>

Re: Numpy and dynamic loading

2015-12-22 Thread Akhil Das

I guess you will have to install numpy on all the machines for this to
work. Try reinstalling on all the machines:

sudo apt-get purge python-numpy
sudo pip uninstall numpy
sudo pip install numpy



Thanks
Best Regards

On Sun, Dec 20, 2015 at 8:33 AM, Abhinav M Kulkarni <
abhinavkulka...@gmail.com> wrote:

> I am running Spark programs on a large cluster (for which, I do not have
> administrative privileges). numpy is not installed on the worker nodes.
> Hence, I bundled numpy with my program, but I get the following error:
>
> Traceback (most recent call last):
>   File "/home/user/spark-script.py", line 12, in 
> import numpy
>   File "/usr/local/lib/python2.7/dist-packages/numpy/__init__.py", line
> 170, in 
>   File "/usr/local/lib/python2.7/dist-packages/numpy/add_newdocs.py", line
> 13, in 
>   File "/usr/local/lib/python2.7/dist-packages/numpy/lib/__init__.py",
> line 8, in 
>   File "/usr/local/lib/python2.7/dist-packages/numpy/lib/type_check.py",
> line 11, in 
>   File "/usr/local/lib/python2.7/dist-packages/numpy/core/__init__.py",
> line 6, in 
> ImportError: cannot import name multiarray
>
> The script is actually quite simple:
>
> from pyspark import SparkConf, SparkContext
> sc = SparkContext()
>
> sc.addPyFile('numpy.zip')
>
> import numpy
>
> a = sc.parallelize(numpy.array([12, 23, 34, 45, 56, 67, 78, 89, 90]))
> print a.collect()
>
> I understand that the error occurs because numpy dynamically loads
> multiarray.so dependency and even if my numpy.zip file includes
> multiarray.so file, somehow the dynamic loading doesn't work with Apache
> Spark. Why so? And how do you othewise create a standalone numpymodule
> with static linking?
>
> P.S. The numpy.zip file I had included with the program was zipped version
> of the numpy installation on my Ubuntu machine. I also tried downloading
> numpy source and building it and bundling it with the program, but the
> problem persisted.
>
> Thanks.
>
>

Re: Memory allocation for Broadcast values

2015-12-22 Thread Akhil Das

If you are creating a huge map on the driver, then spark.driver.memory
should be set to a higher value to hold your map. Since you are going to
broadcast this map, your spark executors must have enough memory to hold
this map as well which can be set using the spark.executor.memory, and
spark.storage.memoryFraction configurations.

Thanks
Best Regards

On Mon, Dec 21, 2015 at 5:50 AM, Pat Ferrel  wrote:

> I have a large Map that is assembled in the driver and broadcast to each
> node.
>
> My question is how best to allocate memory for this.  The Driver has to
> have enough memory for the Maps, but only one copy is serialized to each
> node. What type of memory should I size to match the Maps? Is the broadcast
> Map taking a little from each executor, all from every executor, or is
> there something other than driver and executor memory I can size?
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: configure spark for hive context

2015-12-22 Thread Akhil Das

Looks like you put a wrong configuration file which crashed spark to parse
the configuration values from it.

Thanks
Best Regards

On Mon, Dec 21, 2015 at 3:35 PM, Divya Gehlot 
wrote:

> Hi,
> I am trying to configure spark for hive context  (Please dont get mistaken
> with hive on spark )
> I placed hive-site.xml in spark/CONF_DIR
> Now when I run spark-shell I am getting below error
> Version which I am using
>
>
>
>
> *Hadoop 2.6.2  Spark 1.5.2   Hive 1.2.1 *
>
>
> Welcome to
>>     __
>>  / __/__  ___ _/ /__
>> _\ \/ _ \/ _ `/ __/  '_/
>>/___/ .__/\_,_/_/ /_/\_\   version 1.5.2
>>   /_/
>>
>> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
>> 1.8.0_66)
>> Type in expressions to have them evaluated.
>> Type :help for more information.
>> Spark context available as sc.
>> java.lang.RuntimeException: java.lang.IllegalArgumentException:
>> java.net.URISyntaxException: Relative path in absolute URI:
>> ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D
>> at
>> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
>> at
>> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
>> at
>> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162)
>> at
>> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
>> at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:167)
>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)
>> at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>> at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>> at
>> org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
>> at $iwC$$iwC.(:9)
>> at $iwC.(:18)
>> at (:20)
>> at .(:24)
>> at .()
>> at .(:7)
>> at .()
>> at $print()
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:497)
>> at
>> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>> at
>> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
>> at
>> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>> at
>> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
>> at
>> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
>> at
>> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132)
>> at
>> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124)
>> at
>> org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
>> at
>> org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:124)
>> at
>> org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64)
>> at
>> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974)
>> at
>> org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:159)
>> at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64)
>> at
>> org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:108)
>> at
>> org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64)
>> at
>> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991)
>> at
>> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>> at
>> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>> at
>> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>> at org.apache.spark.repl.SparkILoop.org
>> $apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
>> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
>> at org.apache.spark.repl.Main$.main(Main.scala:31)
>> at org.apache.spark.repl.Main.main(Main.scala)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>>

Re: hive on spark

2015-12-21 Thread Akhil Das

Looks like a version mismatch, you need to investigate more and make sure
the versions satisfies.

Thanks
Best Regards

On Sat, Dec 19, 2015 at 2:15 AM, Ophir Etzion  wrote:

> During spark-submit when running hive on spark I get:
>
> Exception in thread "main" java.util.ServiceConfigurationError: 
> org.apache.hadoop.fs.FileSystem: Provider 
> org.apache.hadoop.hdfs.HftpFileSystem could not be instantiated
>
>
> Caused by: java.lang.IllegalAccessError: tried to access method 
> org.apache.hadoop.fs.DelegationTokenRenewer.(Ljava/lang/Class;)V from 
> class org.apache.hadoop.hdfs.HftpFileSystem
>
> I managed to make hive on spark work on a staging cluster I have and now
> I'm trying to do the same on a production cluster and this happened. Both
> are cdh5.4.3.
>
> I read that this is due to something not being compiled against the correct 
> hadoop version.
> my main question what is the binary/jar/file that can cause this?
>
> I tried replacing the binaries and jars to the ones used by the staging 
> cluster (that hive on spark worked on) and it didn't help.
>
> Thank you for anyone reading this, and thank you for any direction on where 
> to look.
>
> Ophir
>
>

Re: Spark 1.6 - YARN Cluster Mode

2015-12-21 Thread Akhil Das

Try adding these properties:

spark.driver.extraJavaOptions -Dhdp.version=2.3.2.0-2950
spark.yarn.am.extraJavaOptions -Dhdp.version=2.3.2.0-2950

There was a similar discussion with Spark 1.3.0 over here
http://stackoverflow.com/questions/29470542/spark-1-3-0-running-pi-example-on-yarn-fails



Thanks
Best Regards

On Fri, Dec 18, 2015 at 1:33 AM, syepes  wrote:

> Hello,
>
> This week I have been testing 1.6 (#d509194b) in our HDP 2.3 platform and
> its been working pretty ok, at the exception of the YARN cluster deployment
> mode.
> Note that with 1.5 using the same "spark-props.conf" and "spark-env.sh"
> config files the cluster mode works as expected.
>
> Has anyone else also tried the cluster mode in 1.6?
>
>
> Problem reproduction:
> 
> # spark-submit --master yarn --deploy-mode cluster --num-executors 1
> --properties-file $PWD/spark-props.conf --class
> org.apache.spark.examples.SparkPi
> /opt/spark/lib/spark-examples-1.6.0-SNAPSHOT-hadoop2.7.1.jar
>
> Error: Could not find or load main class
> org.apache.spark.deploy.yarn.ApplicationMaster
>
> spark-props.conf
> -
> spark.driver.extraJavaOptions-Dhdp.version=2.3.2.0-2950
> spark.driver.extraLibraryPath
>
> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
> spark.executor.extraJavaOptions  -Dhdp.version=2.3.2.0-2950
> spark.executor.extraLibraryPath
>
> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
> -
>
> I will try to do some more debugging on this issue.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-6-YARN-Cluster-Mode-tp25729.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Error on using updateStateByKey

2015-12-21 Thread Akhil Das

You can do it like this:


private static Function2
  UPDATEFUNCTION =
  new Function2() {
@Override
public Optional call(List nums, Optional current)
throws Exception {
  long sum = current.or(0L);
  for (long i : nums) {
sum += i;
  }
  return Optional.of(sum);
}
  };

.

JavaPairDStream newStream =
myPairStream.updateStateByKey(UPDATEFUNCTION);


Have a look at this codebase

If you need a working example.

Thanks
Best Regards

On Fri, Dec 18, 2015 at 1:42 PM, Abhishek Anand 
wrote:

> I am trying to use updateStateByKey but receiving the following error.
> (Spark Version 1.4.0)
>
> Can someone please point out what might be the possible reason for this
> error.
>
>
> *The method
> updateStateByKey(Function2)
> in the type JavaPairDStream is not applicable
> for the arguments *
>
> * 
> (Function2)*
>
>
> This is the update function that I am using inside updateStateByKey.
>
> I am applying updateStateByKey on a tuple of 
>
> private static Function2 Optional, Optional> updateFunction =
> new Function2 Optional>() {
> /**
> *
> */
> private static final long serialVersionUID = 1L;
>
> @Override
> public Optional call(List values,
> Optional current) {
> AggregationMetrics newSum = current.or(new AggregationMetrics(0L, 0L, 0L));
> for(int i=0; i < values.size(); i++)
> {
> //set with new values
> }
> return Optional.of(newSum);
> }
> };
>
>
>
> Thanks,
> Abhi
>
>

Re: One task hangs and never finishes

2015-12-21 Thread Akhil Das

Pasting the relevant code might help to understand better what exactly you
are doing.

Thanks
Best Regards

On Thu, Dec 17, 2015 at 9:25 PM, Daniel Haviv <
daniel.ha...@veracity-group.com> wrote:

> Hi,
>
> I have an application running a set of transformations and finishes with 
> saveAsTextFile.
>
> Out of 80 tasks all finish pretty fast but one that just hangs and outputs 
> these message to STDERR:
>
> 5/12/17 17:22:19 INFO collection.ExternalAppendOnlyMap: Thread 82 spilling 
> in-memory map of 4.0 GB to disk (6 times so far)
>
> 15/12/17 17:23:41 INFO collection.ExternalAppendOnlyMap: Thread 82 spilling 
> in-memory map of 3.8 GB to disk (7 times so far)
>
>
> Inside the WEBUI I can see that for some reason the shuffle spill memory is 
> exteremly high (15GB) compared to the others (around a few mb to 1 GB) and as 
> a result the GC time is exteremly bad
>
>
>
> IndexIDAttemptStatus  ▴Locality LevelExecutor ID / HostLaunch 
> TimeDurationScheduler DelayTask Deserialization TimeGC TimeResult 
> Serialization TimeGetting Result TimePeak Execution MemoryOutput Size / 
> RecordsShuffle Read Size / RecordsShuffle Spill (Memory)Shuffle Spill 
> (Disk)Errors171530RUNNINGPROCESS_LOCAL8 / impact3.indigo.co.il2015/12/17 
> 17:18:3232 min0 ms0 ms25 min0 ms0 ms0.0 B0.0 B / 0835.8 MB / 521783315.2 
> GB662.9 MB
>
> I'm running with 8 executors with 8 cpus and 25GB ram each and it seems that 
> tasks are correctly spread across the nodes:
>
>
>
> Executor IDAddressRDD BlocksStorage MemoryDisk UsedActive TasksFailed 
> TasksComplete TasksTotal TasksTask TimeInputShuffle ReadShuffle 
> WriteLogsThread Dump1impact1.indigo.co.il:3812000.0 B / 12.9 GB0.0 
> B0025253.54 h2.1 GB377.6 MB555.3 MB
> stdout 
> 
> stderr 
> 
> Thread Dump 
> 2impact4.indigo.co.il:4076800.0
>  B / 12.9 GB0.0 B0024244.32 h2.0 GB513.1 MB495.9 MB
> stdout 
> 
> stderr 
> 
> Thread Dump 
> 3impact2.indigo.co.il:4366600.0
>  B / 12.9 GB0.0 B0024243.78 h2.0 GB332.7 MB503.1 MB
> stdout 
> 
> stderr 
> 
> Thread Dump 
> 4impact3.indigo.co.il:4902000.0
>  B / 12.9 GB0.0 B0026263.39 h2.2 GB532.0 MB596.1 MB
> stdout 
> 
> stderr 
> 
> Thread Dump 
> 5impact1.indigo.co.il:4906800.0
>  B / 12.9 GB0.0 B0024243.30 h2.0 GB187.3 MB502.1 MB
> stdout 
> 
> stderr 
> 
> Thread Dump 
> 6impact4.indigo.co.il:5006900.0
>  B / 12.9 GB0.0 B0028283.64 h2.4 GB336.4 MB498.9 MB
> stdout 
> 
> stderr 
> 
> Thread Dump 
> 7impact2.indigo.co.il:4022500.0
>  B / 12.9 GB0.0 B0028283.62 h2.0 GB93.6 MB496.2 MB
> stdout 
> 
> stderr 
> 
> Thread Dump 
> 8impact3.indigo.co.il:5076700.0
>  B / 12.9 GB0.0 B1024253.38 h2.1 GB336.2

Re: Using Spark to process JSON with gzip filed

2015-12-20 Thread Akhil Das

Yes it is. You can actually use the java.util.zip.GZIPInputStream in your
case.

Thanks
Best Regards

On Sun, Dec 20, 2015 at 3:23 AM, Eran Witkon <eranwit...@gmail.com> wrote:

> Thanks, since it is just a snippt do you mean that Inflater is coming
> from ZLIB?
> Eran
>
> On Fri, Dec 18, 2015 at 11:37 AM Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> Something like this? This one uses the ZLIB compression, you can replace
>> the decompression logic with GZip one in your case.
>>
>> compressedStream.map(x => {
>>   val inflater = new Inflater()
>>   inflater.setInput(x.getPayload)
>>   val decompressedData = new Array[Byte](x.getPayload.size * 2)
>>   var count = inflater.inflate(decompressedData)
>>   var finalData = decompressedData.take(count)
>>   while (count > 0) {
>> count = inflater.inflate(decompressedData)
>> finalData = finalData ++ decompressedData.take(count)
>>   }
>>   new String(finalData)
>> })
>>
>>
>>
>>
>> Thanks
>> Best Regards
>>
>> On Wed, Dec 16, 2015 at 10:02 PM, Eran Witkon <eranwit...@gmail.com>
>> wrote:
>>
>>> Hi,
>>> I have a few JSON files in which one of the field is a binary filed -
>>> this field is the output of running GZIP of a JSON stream and compressing
>>> it to the binary field.
>>>
>>> Now I want to de-compress the field and get the outpur JSON.
>>> I was thinking of running map operation and passing a function to the
>>> map operation which will decompress each JSON file.
>>> the above function will find the right field in the outer JSON and then
>>> run GUNZIP on it.
>>>
>>> 1) is this a valid practice for spark map job?
>>> 2) any pointer on how to do that?
>>>
>>> Eran
>>>
>>
>>

Re: Saving to JDBC

2015-12-18 Thread Akhil Das

You will have to properly order the columns before writing or you can
change the column order in the actual table according to your job.

Thanks
Best Regards

On Tue, Dec 15, 2015 at 1:47 AM, Bob Corsaro  wrote:

> Is there anyway to map pyspark.sql.Row columns to JDBC table columns, or
> do I have to just put them in the right order before saving?
>
> I'm using code like this:
>
> ```
> rdd = rdd.map(lambda i: Row(name=i.name, value=i.value))
> sqlCtx.createDataFrame(rdd).write.jdbc(dbconn_string, tablename,
> mode='append')
> ```
>
> Since the Row class orders them alphabetically, they are inserted into the
> sql table in alphabetical order instead of matching Row columns to table
> columns.
>

Re: UNSUBSCRIBE

2015-12-18 Thread Akhil Das

Send the mail to user-unsubscr...@spark.apache.org read more over here
http://spark.apache.org/community.html

Thanks
Best Regards

On Tue, Dec 15, 2015 at 3:39 AM, Mithila Joshi 
wrote:

> unsubscribe
>
> On Mon, Dec 14, 2015 at 4:49 PM, Tim Barthram 
> wrote:
>
>> UNSUBSCRIBE Thanks
>>
>>
>>
>> _
>>
>> The information transmitted in this message and its attachments (if any)
>> is intended
>> only for the person or entity to which it is addressed.
>> The message may contain confidential and/or privileged material. Any
>> review,
>> retransmission, dissemination or other use of, or taking of any action in
>> reliance
>> upon this information, by persons or entities other than the intended
>> recipient is
>> prohibited.
>>
>> If you have received this in error, please contact the sender and delete
>> this e-mail
>> and associated material from any computer.
>>
>> The intended recipient of this e-mail may only use, reproduce, disclose
>> or distribute
>> the information contained in this e-mail and any attached files, with the
>> permission
>> of the sender.
>>
>> This message has been scanned for viruses.
>> _
>>
>
>

Re: How to do map join in Spark SQL

2015-12-18 Thread Akhil Das

You can broadcast your json data and then do a map side join. This article
is a good start http://dmtolpeko.com/2015/02/20/map-side-join-in-spark/

Thanks
Best Regards

On Wed, Dec 16, 2015 at 2:51 AM, Alexander Pivovarov 
wrote:

> I have big folder having ORC files. Files have duration field (e.g.
> 3,12,26, etc)
> Also I have small json file  (just 8 rows) with ranges definition (min,
> max , name)
> 0, 10, A
> 10, 20, B
> 20, 30, C
> etc
>
> Because I can not do equi-join btw duration and range min/max I need to do
> cross join and apply WHERE condition to take records which belong to the
> range
> Cross join is an expensive operation I think that it's better if this
> particular join done using Map Join
>
> How to do Map join in Spark Sql?
>

Re: Using Spark to process JSON with gzip filed

2015-12-18 Thread Akhil Das

Something like this? This one uses the ZLIB compression, you can replace
the decompression logic with GZip one in your case.

compressedStream.map(x => {
  val inflater = new Inflater()
  inflater.setInput(x.getPayload)
  val decompressedData = new Array[Byte](x.getPayload.size * 2)
  var count = inflater.inflate(decompressedData)
  var finalData = decompressedData.take(count)
  while (count > 0) {
count = inflater.inflate(decompressedData)
finalData = finalData ++ decompressedData.take(count)
  }
  new String(finalData)
})




Thanks
Best Regards

On Wed, Dec 16, 2015 at 10:02 PM, Eran Witkon  wrote:

> Hi,
> I have a few JSON files in which one of the field is a binary filed - this
> field is the output of running GZIP of a JSON stream and compressing it to
> the binary field.
>
> Now I want to de-compress the field and get the outpur JSON.
> I was thinking of running map operation and passing a function to the map
> operation which will decompress each JSON file.
> the above function will find the right field in the outer JSON and then
> run GUNZIP on it.
>
> 1) is this a valid practice for spark map job?
> 2) any pointer on how to do that?
>
> Eran
>

Re: Unable to get json for application jobs in spark 1.5.0

2015-12-18 Thread Akhil Das

Which version of spark are you using? You can test this by opening up a
spark-shell, firing a simple job (sc.parallelize(1 to 100).collect()) and
then accessing the
http://sigmoid-driver:4040/api/v1/applications/Spark%20shell/jobs

[image: Inline image 1]

Thanks
Best Regards

On Tue, Dec 15, 2015 at 4:46 PM, rakesh rakshit  wrote:

> Hi all,
>
> I am trying to get the json for all the jobs running within my application
> whose UI port is 4040.
>
> I am making an HTTP GET request at the following URI:
>
> http://172.26.32.143:4040/api/v1/applications/gauravpipe/jobs
>
> But getting the following service unavailable exception:
>
> 
> 
> 
> Error 503 Service Unavailable
> 
> HTTP ERROR 503
> Problem accessing /api/v1/applications/gauravpipe/jobs. Reason:
> Service UnavailableCaused 
> by:org.spark-project.jetty.servlet.ServletHolder$1: 
> java.lang.reflect.InvocationTargetException
>   at 
> org.spark-project.jetty.servlet.ServletHolder.makeUnavailable(ServletHolder.java:496)
>   at 
> org.spark-project.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:543)
>   at 
> org.spark-project.jetty.servlet.ServletHolder.getServlet(ServletHolder.java:415)
>   at 
> org.spark-project.jetty.servlet.ServletHolder.handle(ServletHolder.java:657)
>   at 
> org.spark-project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
>   at 
> org.spark-project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
>   at 
> org.spark-project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
>   at 
> org.spark-project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
>   at 
> org.spark-project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>   at 
> org.spark-project.jetty.server.handler.GzipHandler.handle(GzipHandler.java:264)
>   at 
> org.spark-project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>   at 
> org.spark-project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>   at org.spark-project.jetty.server.Server.handle(Server.java:370)
>   at 
> org.spark-project.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
>   at 
> org.spark-project.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
>   at 
> org.spark-project.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
>   at 
> org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:644)
>   at 
> org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
>   at 
> org.spark-project.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
>   at 
> org.spark-project.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
>   at 
> org.spark-project.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
>   at 
> org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>   at 
> org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> com.sun.jersey.spi.container.servlet.WebComponent.createResourceConfig(WebComponent.java:728)
>   at 
> com.sun.jersey.spi.container.servlet.WebComponent.createResourceConfig(WebComponent.java:678)
>   at 
> com.sun.jersey.spi.container.servlet.WebComponent.init(WebComponent.java:203)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.init(ServletContainer.java:373)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.init(ServletContainer.java:556)
>   at javax.servlet.GenericServlet.init(GenericServlet.java:241)
>   at 
> org.spark-project.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:532)
>   ... 22 more
> Caused by: java.lang.IncompatibleClassChangeError: Implementing class
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at

Re: about spark on hbase

2015-12-18 Thread Akhil Das

*First you create the HBase configuration:*

  val hbaseTableName = "paid_daylevel"
  val hbaseColumnName = "paid_impression"
  val hconf = HBaseConfiguration.create()
  hconf.set("hbase.zookeeper.quorum", "sigmoid-dev-master")
  hconf.set("hbase.zookeeper.property.clientPort", "2182")
  hconf.set("hbase.defaults.for.version.skip", "true")
  hconf.set(TableOutputFormat.OUTPUT_TABLE, hbaseTableName)
  hconf.setClass("mapreduce.job.outputformat.class",
classOf[TableOutputFormat[String]], classOf[OutputFormat[String, Mutation]])
  val admin = new HBaseAdmin(hconf)

*Then you read the values:*

val values = sparkContext.newAPIHadoopRDD(hconf, classOf[TableInputFormat],
classOf[ImmutableBytesWritable], classOf[Result]).map {
  case (key, row) => {
val rowkey = Bytes.toString(key.get())
val valu = Bytes.toString(row.getValue(Bytes.toBytes("CF"),
Bytes.toBytes(hbaseColumnName)))

(rowkey, valu.toInt)
  }



*Then you modify or do whatever you want with the values using the rdd
transformations and then save the values:*

values.map(valu => (new ImmutableBytesWritable, {
  val record = new Put(Bytes.toBytes(valu._1))
  record.add(Bytes.toBytes("CF"), Bytes.toBytes(hbaseColumnName),
Bytes.toBytes(valu._2.toString))

  record
}
  )
).saveAsNewAPIHadoopDataset(hconf)



You can also look at the
http://spark-packages.org/package/nerdammer/spark-hbase-connector 




Thanks
Best Regards

On Tue, Dec 15, 2015 at 6:08 PM, censj  wrote:

> hi,all:
> how cloud I through spark function  hbase get value then update this
> value and put this value to hbase ?
>

Re: security testing on spark ?

2015-12-18 Thread Akhil Das

If the port 7077 is open for public on your cluster, that's all you need to
take over the cluster. You can read a bit about it here
https://www.sigmoid.com/securing-apache-spark-cluster/

You can also look at this small exploit I wrote
https://www.exploit-db.com/exploits/36562/

Thanks
Best Regards

On Wed, Dec 16, 2015 at 6:46 AM, Judy Nash 
wrote:

> Hi all,
>
>
>
> Does anyone know of any effort from the community on security testing
> spark clusters.
>
> I.e.
>
> Static source code analysis to find security flaws
>
> Penetration testing to identify ways to compromise spark cluster
>
> Fuzzing to crash spark
>
>
>
> Thanks,
>
> Judy
>
>
>

Re: spark master process shutdown for timeout

2015-12-18 Thread Akhil Das

Did you happened to have a look at this
https://issues.apache.org/jira/browse/SPARK-9629

Thanks
Best Regards

On Thu, Dec 17, 2015 at 12:02 PM, yaoxiaohua  wrote:

> Hi guys,
>
> I have two nodes used as spark master, spark1,spark2
>
> Spark1.4.0
>
> Jdk 1.7 sunjdk
>
>
>
> Now these days I found that spark2 master process may shutdown , I found
> that in log file:
>
> 15/12/17 13:09:58 INFO ClientCnxn: Client session timed out, have not
> heard from server in 40020ms for sessionid 0x351a889694144b1, closing
> socket connection and attempting reconnect
>
> 15/12/17 13:09:58 INFO ConnectionStateManager: State change: SUSPENDED
>
> 15/12/17 13:09:58 INFO ZooKeeperLeaderElectionAgent: We have lost
> leadership
>
> 15/12/17 13:09:58 ERROR Master: Leadership has been revoked -- master
> shutting down.
>
> 15/12/17 13:09:58 INFO Utils: Shutdown hook called
>
>
>
> It looks like timeout , I don’t know how to change the configure to avoid
> this case happened, please help me.
>
> Thanks.
>
>
>
> Best Regards,
>
> Evan Yao
>

Re: Unsubsribe

2015-12-14 Thread Akhil Das

Send an email to user-unsubscr...@spark.apache.org to unsubscribe from the
list. See more over http://spark.apache.org/community.html

Thanks
Best Regards

2015-12-09 22:18 GMT+05:30 Michael Nolting :

> cancel
>
> --
> 
> *Michael Nolting*
> Head of Sevenval FDX
> 
> *Sevenval Technologies GmbH *
>
> FRONT-END-EXPERTS SINCE 1999
>
> Köpenicker Straße 154 | 10997 Berlin
>
> office   +49 30 707 190 - 278
> mail michael.nolt...@sevenval.com 
>
> www.sevenval.com
>
> Sitz: Köln, HRB 79823
> Geschäftsführung: Jan Webering (CEO), Thorsten May, Joern-Carlos Kuntze
>
> *Wir erhöhen den Return On Investment bei Ihren Mobile und Web-Projekten.
> Sprechen Sie uns an: *http://roi.sevenval.com/
>
> ---
> FOLLOW US on
>
> [image: Sevenval blog]
> 
>
> [image: sevenval on twitter]
> 
>  [image: sevenval on linkedin]
> [image:
> sevenval on pinterest]
> 
>
>
>

Re: Can't filter

2015-12-14 Thread Akhil Das

If you are not using Spark submit to run the job, then you need to add the
following line:

sc.addJar("target/scala_2.11/spark.jar")


After creating the SparkContext, where the spark.jar is your project jar.

Thanks
Best Regards

On Thu, Dec 10, 2015 at 5:29 PM, Бобров Виктор  wrote:

> Build.sbt
>
>
>
> *name *:=
>
> *"spark"**version *:=
>
> *"1.0"**scalaVersion *:= *"2.11.7"*
>
>
>
> libraryDependencies ++= *Seq*(
>   *"org.apache.spark"  *% *"spark-core_2.11"  *% *"1.5.1"*,
>   *"org.apache.spark"  *% *"spark-streaming_2.11" *% *"1.5.1"*,
>   *"org.apache.spark"  *% *"spark-mllib_2.11" *% *"1.5.1"*
>
>
>
>
>
> *From:* Бобров Виктор [mailto:ma...@bk.ru]
> *Sent:* Thursday, December 10, 2015 2:54 PM
> *To:* 'Harsh J' 
> *Cc:* user@spark.apache.org
> *Subject:* RE: Can't filter
>
>
>
> Spark – 1.5.1, ty for help.
>
>
>
> *import *org.apache.spark.SparkContext
> *import *org.apache.spark.SparkContext._
> *import *org.apache.spark.SparkConf
> *import *scala.io.Source
>
>
> *object *SimpleApp {
> *def *main(args: Array[String]) {
> *var *A = scala.collection.mutable.Map[Array[String], Int]()
> *val *filename = *"C:**\\**Users**\\**bobrov**\\**IdeaProjects**\\*
> *spark**\\**file**\\*
> *spark1.txt"**for*((line, i) <- Source.*fromFile*
> (filename).getLines().zipWithIndex){
>   *val *lst = line.split(*" "*)
>   A += (lst -> i)
> }
>
> *def *filter1(tp: ((Array[String], Int), (Array[String], Int))):
> Boolean= {
>   tp._1._2 < tp._2._2
>   }
>
> *val *conf = *new *SparkConf().setMaster(*"spark://web01:7077"*
> ).setAppName(*"Simple Application"*)
> *val *sc = *new *SparkContext(conf)
> *val *mail_rdd = sc.parallelize(A.toSeq).cache()
> *val *step1 = mail_rdd.cartesian(mail_rdd)
> *val *step2 = step1.filter(filter1)
> //step1.collect().foreach(*println*)
>   }
> }
>
>
>
> *From:* Harsh J [mailto:ha...@cloudera.com ]
> *Sent:* Thursday, December 10, 2015 2:50 PM
> *To:* Бобров Виктор ; Ndjido Ardo Bar 
> *Cc:* user@spark.apache.org
> *Subject:* Re: Can't filter
>
>
>
> Are you sure you do not have any messages preceding the trace, such as one
> quoting which class is found to be missing? That'd be helpful to see and
> suggest what may (exactly) be going wrong. It appear similar to
> https://issues.apache.org/jira/browse/SPARK-8368, but I cannot tell for
> certain cause I don't know if your code uses the SparkSQL features.
>
>
>
> Also, what version is your Spark running?
>
>
>
> I am able to run your program without a problem in Spark 1.5.x (with a
> sample Seq).
>
>
>
> On Thu, Dec 10, 2015 at 5:01 PM Бобров Виктор  wrote:
>
> 0 = {StackTraceElement@7132}
> "com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.a(Unknown
> Source)"
>
> 1 = {StackTraceElement@7133}
> "com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.(Unknown
> Source)"
>
> 2 = {StackTraceElement@7134}
> "org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:40)"
>
> 3 = {StackTraceElement@7135}
> "org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:81)"
>
> 4 = {StackTraceElement@7136}
> "org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:187)"
>
> 5 = {StackTraceElement@7137}
> "org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)"
>
> 6 = {StackTraceElement@7138}
> "org.apache.spark.SparkContext.clean(SparkContext.scala:2030)"
>
> 7 = {StackTraceElement@7139}
> "org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:331)"
>
> 8 = {StackTraceElement@7140}
> "org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:330)"
>
> 9 = {StackTraceElement@7141}
> "org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)"
>
> 10 = {StackTraceElement@7142}
> "org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)"
>
> 11 = {StackTraceElement@7143}
> "org.apache.spark.rdd.RDD.withScope(RDD.scala:306)"
>
> 12 = {StackTraceElement@7144}
> "org.apache.spark.rdd.RDD.filter(RDD.scala:330)"
>
> 13 = {StackTraceElement@7145}
> "SimpleApp$GeneratedEvaluatorClass$44$1.invoke(FileToCompile0.scala:30)"
>
> 14 = {StackTraceElement@7146} "SimpleApp$.main(test1.scala:26)"
>
> 15 = {StackTraceElement@7147} "SimpleApp.main(test1.scala)"
>
>
>
> *From:* Ndjido Ardo Bar [mailto:ndj...@gmail.com]
> *Sent:* Thursday, December 10, 2015 2:20 PM
> *To:* Бобров Виктор 
> *Cc:* user@spark.apache.org
> *Subject:* Re: Can't filter
>
>
>
> Please send your call stack with the full description of the exception .
>
>
> On 10 Dec 2015, at 12:10, Бобров Виктор  wrote:
>
> Hi, I can’t filter my rdd.
>
>
>
> *def *filter1(tp: ((Array[String], Int), (Array[String], Int))): Boolean=
> {
>   tp._1._2 > tp._2._2
> }
> *val *mail_rdd = sc.parallelize(A.toSeq).cache()
> *val

Re: How to change StreamingContext batch duration after loading from checkpoint

2015-12-14 Thread Akhil Das

Taking the values from a configuration file rather hard-coding in the code
might help, haven't tried it though.

Thanks
Best Regards

On Mon, Dec 7, 2015 at 9:53 PM, yam  wrote:

> Is there a way to change the streaming context batch interval after
> reloading
> from checkpoint?
>
> I would like to be able to change the batch interval after restarting the
> application without loosing the checkpoint of course.
>
> Thanks!
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-change-StreamingContext-batch-duration-after-loading-from-checkpoint-tp25624.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: HDFS

2015-12-14 Thread Akhil Das

Try to set the spark.locality.wait to a higher number and see if things
change. You can read more about the configuration properties from here
http://spark.apache.org/docs/latest/configuration.html#scheduling

Thanks
Best Regards

On Sat, Dec 12, 2015 at 12:16 AM, shahid ashraf  wrote:

> hi Folks
>
> I am using standalone cluster of 50 servers on aws. i loaded data on hdfs,
>  why i am getting Locality Level as ANY for data on hdfs, i have 900+
> partitions.
>
>
> --
> with Regards
> Shahid Ashraf
>

Re: IP error on starting spark-shell on windows 7

2015-12-13 Thread Akhil Das

Its a warning, not an error. What happens when you don't specify
SPARK_LOCAL_IP at all? If it is able to bring up the spark shell, then
try *netstat
-np* and see on which address the driver is binding to.

Thanks
Best Regards

On Thu, Dec 10, 2015 at 9:49 AM, Stefan Karos  wrote:

> On starting spark-shell I see this just before the scala prompt:
>
> WARN : Your hostname, BloomBear-SSD resolves to a loopback/non-reachable
> address: fe80:0:0:0:0:5efe:c0a8:317%net10, but we couldn't find any
> external IP address!
>
> I get this error even when firewall is disabled.
> I also tried setting the environment variable SPARK_IP_LOCAL to various
> choices listed below:
>
> SPARK_LOCAL_IP=localhost
> SPARK_LOCAL_IP=127.0.0.1
> SPARK_LOCAL_IP=192.168.1.88   (my local machine's IPv4 address)
> SPARK_LOCAL_IP=fe80::eda5:a1a7:be1e:13cb%14  (my local machine's IPv6
> address)
>
> I still get this annoying error! How can I resolve this?
> See below for my environment
>
> Environment
> windows 7 64 bit
> Spark 1.5.2
> Scala 2.10.6
> Python 2.7.10 (from Anaconda)
>
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> C:\ProgramData\Oracle\Java\javapath
>
> SYSTEM variables set are:
> SPARK_HOME=C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0
>
> \tmp\hive directory at root on C; drive with full permissions,
> e.g.
> >winutils ls \tmp\hive
> drwxrwxrwx 1 BloomBear-SSD\Stefan BloomBear-SSD\None 0 Dec  8 2015
> \tmp\hive
>
>

Re: Not able to receive data in spark from rsyslog

2015-12-07 Thread Akhil Das

Just make sure you are binding on the correct interface.

- java.net.ConnectException: Connection refused


Means spark was not able to connect to that host/port. You can validate it
by telneting to that host/port.



Thanks
Best Regards

On Fri, Dec 4, 2015 at 1:00 PM, masoom alam 
wrote:

> I am getting am error that I am not able receive data in spark streaming
> application from spark.please help with any pointers.
> 9 - java.net.ConnectException: Connection refused
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
> at
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
> at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:579)
> at java.net.Socket.connect(Socket.java:528)
> at java.net.Socket.(Socket.java:425)
> at java.net.Socket.(Socket.java:208)
> at
> org.apache.spark.streaming.dstream.SocketReceiver.receive(SocketInputDStream.scala:73)
> at
> org.apache.spark.streaming.dstream.SocketReceiver$$anon$2.run(SocketInputDStream.scala:59)
>
> 15/12/04 02:21:29 INFO ReceiverSupervisorImpl: Stopped receiver 0
>
> However from nc -lk 999 gives the data which is received perfectlyany
> clue...
>
> Thanks
>

Re: How the cores are used in Directstream approach

2015-12-07 Thread Akhil Das

You will have to do a repartition after creating the dstream to utilize all
cores. directStream keeps exactly the same partitions as in kafka for spark.

Thanks
Best Regards

On Thu, Dec 3, 2015 at 9:42 AM, Charan Ganga Phani Adabala <
char...@eiqnetworks.com> wrote:

> Hi,
>
> We have* 1 kafka topic*, by using the direct stream approach in spark we
> have to processing the data present in topic , with one node R cluster
> for to understand how the Spark will behave.
>
> My machine configuration is *4 Cores, 16 GB RAM with 1 executor.*
>
> My question is how many cores are used for this job while running.
>
> *In web console it show 4 cores are used.*
>
> *How the cores are used in Directstream approach*?
>
> Command to run the Job :
>
> *./spark/bin/spark-submit --master spark://XX.XX.XX.XXX:7077 --class
> org.eiq.IndexingClient ~/spark/lib/IndexingClient.jar*
>
>
>
> Thanks & Regards,
>
> *Ganga Phani Charan Adabala | Software Engineer*
>
> o:  +91-40-23116680 | c:  +91-9491418099
>
> EiQ Networks, Inc. 
>
>
>
>
>
> [image: cid:image001.png@01D11C9D.AF5CC1F0] 
>
> *"This email is intended only for the use of the individual or entity
> named above and may contain information that is confidential and
> privileged. If you are not the intended recipient, you are hereby notified
> that any dissemination, distribution or copying of the email is strictly
> prohibited. If you have received this email in error, please destroy
> the original message."*
>
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

Re: Spark Streaming Shuffle to Disk

2015-12-07 Thread Akhil Das

UpdateStateByKey and your batch data could be filling up your executor
memory and hence it might be hitting the disk, you can verify it by looking
at the memory footprint while your job is running. Looking at the executor
logs will also give you a better understanding of whats going on.

Thanks
Best Regards

On Fri, Dec 4, 2015 at 8:24 AM, Steven Pearson  wrote:

> I'm running a Spark Streaming job on 1.3.1 which contains an
> updateStateByKey.  The job works perfectly fine, but at some point (after a
> few runs), it starts shuffling to disk no matter how much memory I give the
> executors.
>
> I have tried changing --executor-memory on
> spark-submit, spark.shuffle.memoryFraction, spark.storage.memoryFraction,
> and spark.storage.unrollFraction.  But no matter how I configure these, it
> always spills to disk around 2.5GB.
>
> What is the best way to avoid spilling shuffle to disk?
>
>

Re: Predictive Modeling

2015-12-07 Thread Akhil Das

You can write a simple python script to process the 1.5GB dataset, use the
pandas library for building your predictive model.

Thanks
Best Regards

On Fri, Dec 4, 2015 at 3:02 PM, Chintan Bhatt <
chintanbhatt...@charusat.ac.in> wrote:

> Hi,
> I'm very much interested to make a predictive model using crime data
> (from 2001-present. It is big .csv file (about 1.5 GB) )in spark on
> hortonworks.
> Can anyone tell me how to start?
>
> --
> CHINTAN BHATT 
> Assistant Professor,
> U & P U Patel Department of Computer Engineering,
> Chandubhai S. Patel Institute of Technology,
> Charotar University of Science And Technology (CHARUSAT),
> Changa-388421, Gujarat, INDIA.
> http://www.charusat.ac.in
> *Personal Website*: https://sites.google.com/a/ecchanga.ac.in/chintan/
>
> [image: IBM]
> 
>
>
> DISCLAIMER: The information transmitted is intended only for the person or
> entity to which it is addressed and may contain confidential and/or
> privileged material which is the intellectual property of Charotar
> University of Science & Technology (CHARUSAT). Any review, retransmission,
> dissemination or other use of, or taking of any action in reliance upon
> this information by persons or entities other than the intended recipient
> is strictly prohibited. If you are not the intended recipient, or the
> employee, or agent responsible for delivering the message to the intended
> recipient and/or if you have received this in error, please contact the
> sender and delete the material from the computer or device. CHARUSAT does
> not take any liability or responsibility for any malicious codes/software
> and/or viruses/Trojan horses that may have been picked up during the
> transmission of this message. By opening and solely relying on the contents
> or part thereof this message, and taking action thereof, the recipient
> relieves the CHARUSAT of all the liabilities including any damages done to
> the recipient's pc/laptop/peripherals and other communication devices due
> to any reason.
>

Re: Spark applications metrics

2015-12-07 Thread Akhil Das

Usually your application is composed of jobs and jobs are composed of
tasks, on the task level you can see how much read/write was happened from
the stages tab of your driver ui.

Thanks
Best Regards

On Fri, Dec 4, 2015 at 6:20 PM, patcharee  wrote:

> Hi
>
> How can I see the summary of data read / write, shuffle read / write, etc
> of an Application, not per stage?
>
> Thanks,
> Patcharee
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Not all workers seem to run in a standalone cluster setup by spark-ec2 script

2015-12-07 Thread Akhil Das

Whats in your SparkIsAwesome class? Just make sure that you are giving
enough partition to spark to evenly distribute the job throughout the
cluster.
Try submitting the job this way:

~/spark/bin/spark-submit --executor-cores 10 --executor-memory 5G
--driver-memory 5G --class com.example.SparkIsAwesome awesome/spark.jar


Thanks
Best Regards

On Sat, Dec 5, 2015 at 12:58 AM, Kyohey Hamaguchi 
wrote:

> Hi,
>
> I have setup a Spark standalone-cluster, which involves 5 workers,
> using spark-ec2 script.
>
> After submitting my Spark application, I had noticed that just one
> worker seemed to run the application and other 4 workers were doing
> nothing. I had confirmed this by checking CPU and memory usage on the
> Spark Web UI (CPU usage indicates zero and memory is almost fully
> availabile.)
>
> This is the command used to launch:
>
> $ ~/spark/ec2/spark-ec2 -k awesome-keypair-name -i
> /path/to/.ssh/awesome-private-key.pem --region ap-northeast-1
> --zone=ap-northeast-1a --slaves 5 --instance-type m1.large
> --hadoop-major-version yarn launch awesome-spark-cluster
>
> And the command to run application:
>
> $ ssh -i ~/path/to/awesome-private-key.pem root@ec2-master-host-name
> "mkdir ~/awesome"
> $ scp -i ~/path/to/awesome-private-key.pem spark.jar
> root@ec2-master-host-name:~/awesome && ssh -i
> ~/path/to/awesome-private-key.pem root@ec2-master-host-name
> "~/spark-ec2/copy-dir ~/awesome"
> $ ssh -i ~/path/to/awesome-private-key.pem root@ec2-master-host-name
> "~/spark/bin/spark-submit --num-executors 5 --executor-cores 2
> --executor-memory 5G --total-executor-cores 10 --driver-cores 2
> --driver-memory 5G --class com.example.SparkIsAwesome
> awesome/spark.jar"
>
> How do I let the all of the workers execute the app?
>
> Or do I have wrong understanding on what workers, slaves and executors are?
>
> My understanding is: Spark driver(or maybe master?) sends a part of
> jobs to each worker (== executor == slave), so a Spark cluster
> automatically exploits all resources available in the cluster. Is this
> some sort of misconception?
>
> Thanks,
>
> --
> Kyohey Hamaguchi
> TEL:  080-6918-1708
> Mail: tnzk.ma...@gmail.com
> Blog: http://blog.tnzk.org/
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: How to access a RDD (that has been broadcasted) inside the filter method of another RDD?

2015-12-07 Thread Akhil Das

Something like this?

val broadcasted = sc.broadcast(...)

RDD2.filter(value => {

//simply use *broadcasted*
if(broadcasted.contains(value)) true

})



Thanks
Best Regards

On Fri, Dec 4, 2015 at 10:43 PM, Abhishek Shivkumar <
abhishek.shivku...@bigdatapartnership.com> wrote:

> Hi,
>
>  I have RDD1 that is broadcasted.
>
> I have a user defined method for the filter functionality of RDD2, written
> as follows:
>
> RDD2.filter(my_func)
>
>
> I want to access the values of RDD1 inside my_func. Is that possible?
> Should I pass RDD1 as a parameter into my_func?
>
> Thanks
> Abhishek S
>
> *NOTICE AND DISCLAIMER*
>
> This email (including attachments) is confidential. If you are not the
> intended recipient, notify the sender immediately, delete this email from
> your system and do not disclose or use for any purpose.
>
> Business Address: Eagle House, 163 City Road, London, EC1V 1NR. United
> Kingdom
> Registered Office: Finsgate, 5-7 Cranwood Street, London, EC1V 9EE. United
> Kingdom
> Big Data Partnership Limited is a company registered in England & Wales
> with Company No 7904824
>

Re: AWS CLI --jars comma problem

2015-12-07 Thread Akhil Das

Not a direct answer but you can create a big fat jar combining all the
classes in the three jars and pass it.

Thanks
Best Regards

On Thu, Dec 3, 2015 at 10:21 PM, Yusuf Can Gürkan 
wrote:

> Hello
>
> I have a question about AWS CLI for people who use it.
>
> I create a spark cluster with aws cli and i’m using spark step with jar
> dependencies. But as you can see below i can not set multiple jars because
> AWS CLI replaces comma with space in ARGS.
>
> Is there a way of doing it? I can accept every kind of solutions. For
> example, i tried to merge these two jar dependencies but i could not manage
> it.
>
>
> aws emr create-cluster
> …..
> …..
> Args=[--class,com.blabla.job,
> —jars,"/home/hadoop/first.jar,/home/hadoop/second.jar",
> /home/hadoop/main.jar,--verbose]
>
>
> I also tried to escape comma with \\, but it did not work.
>

Re: Getting error when trying to start master node after building spark 1.3

2015-12-07 Thread Akhil Das

Did you read
http://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support



Thanks
Best Regards

On Fri, Dec 4, 2015 at 4:12 PM, Mich Talebzadeh  wrote:

> Hi,
>
>
>
>
>
> I am trying to make Hive work with Spark.
>
>
>
> I have been told that I need to use Spark 1.3 and build it from source
> code WITHOUT HIVE libraries.
>
>
>
> I have built it as follows:
>
>
>
> ./make-distribution.sh --name "hadoop2-without-hive" --tgz
> "-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"
>
>
>
>
>
> Now the issue I have that I cannot start master node.
>
>
>
> When I try
>
>
>
> hduser@rhes564::/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin>
> ./start-master.sh
>
> starting org.apache.spark.deploy.master.Master, logging to
> /usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin/../logs/spark-hduser-org.apache.spark.deploy.master.Master-1-rhes564.out
>
> failed to launch org.apache.spark.deploy.master.Master:
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
> ... 6 more
>
> full log in
> /usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin/../logs/spark-hduser-org.apache.spark.deploy.master.Master-1-rhes564.out
>
>
>
> I get
>
>
>
> Spark Command: /usr/java/latest/bin/java -cp
> :/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/sbin/../conf:/usr/lib/spark-1.3.0-bin-hadoop2-without-hive/lib/spark-assembly-1.3.0-hadoop2.4.0.jar:/home/hduser/hadoop-2.6.0/etc/hadoop
> -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m
> org.apache.spark.deploy.master.Master --ip 50.140.197.217 --port 7077
> --webui-port 8080
>
> 
>
>
>
> Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger
>
> at java.lang.Class.getDeclaredMethods0(Native Method)
>
> at java.lang.Class.privateGetDeclaredMethods(Class.java:2521)
>
> at java.lang.Class.getMethod0(Class.java:2764)
>
> at java.lang.Class.getMethod(Class.java:1653)
>
> at
> sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
>
> at
> sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
>
> Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
> ... 6 more
>
>
>
> Any advice will be appreciated.
>
>
>
> Thanks,
>
>
>
> Mich
>
>
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1302 matches

Mail list logo