date:20160923

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-23 Thread Yash Sharma

Hi Dhruve, thanks.
I've solved the issue with adding max executors.
I wanted to find some place where I can add this behavior in Spark so that
user should not have to worry about the max executors.

Cheers

- Thanks, via mobile,  excuse brevity.

On Sep 24, 2016 1:15 PM, "dhruve ashar"  wrote:

> From your log, its trying to launch every executor with approximately
> 6.6GB of memory. 168510 is an extremely huge no. executors and 168510 x
> 6.6GB is unrealistic for a 12 node cluster.
> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
> containers, each with 2 cores and 6758 MB memory including 614 MB overhead
>
> I don't know the size of the data that you are processing here.
>
> Here are some general choices that I would start with.
>
> Start with a smaller no. of minimum executors and assign them reasonable
> memory. This can be around 48 assuming 12 nodes x 4 cores each. You could
> start with processing a subset of your data and see if you are able to get
> a decent performance. Then gradually increase the maximum # of execs for
> dynamic allocation and process the remaining data.
>
>
>
>
> On Fri, Sep 23, 2016 at 7:54 PM, Yash Sharma  wrote:
>
>> Is there anywhere I can help fix this ?
>>
>> I can see the requests being made in the yarn allocator. What should be
>> the upperlimit of the requests made ?
>>
>> https://github.com/apache/spark/blob/master/yarn/src/main/
>> scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L222
>>
>> On Sat, Sep 24, 2016 at 10:27 AM, Yash Sharma  wrote:
>>
>>> Have been playing around with configs to crack this. Adding them here
>>> where it would be helpful to others :)
>>> Number of executors and timeout seemed like the core issue.
>>>
>>> {code}
>>> --driver-memory 4G \
>>> --conf spark.dynamicAllocation.enabled=true \
>>> --conf spark.dynamicAllocation.maxExecutors=500 \
>>> --conf spark.core.connection.ack.wait.timeout=6000 \
>>> --conf spark.akka.heartbeat.interval=6000 \
>>> --conf spark.akka.frameSize=100 \
>>> --conf spark.akka.timeout=6000 \
>>> {code}
>>>
>>> Cheers !
>>>
>>> On Fri, Sep 23, 2016 at 7:50 PM, 
>>> wrote:
>>>
 For testing purpose can you run with fix number of executors and try.
 May be 12 executors for testing and let know the status.

 Get Outlook for Android 



 On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma"  wrote:

 Thanks Aditya, appreciate the help.
>
> I had the exact thought about the huge number of executors requested.
> I am going with the dynamic executors and not specifying the number of
> executors. Are you suggesting that I should limit the number of executors
> when the dynamic allocator requests for more number of executors.
>
> Its a 12 node EMR cluster and has more than a Tb of memory.
>
>
>
> On Fri, Sep 23, 2016 at 5:12 PM, Aditya  co.in> wrote:
>
>> Hi Yash,
>>
>> What is your total cluster memory and number of cores?
>> Problem might be with the number of executors you are allocating. The
>> logs shows it as 168510 which is on very high side. Try reducing your
>> executors.
>>
>>
>> On Friday 23 September 2016 12:34 PM, Yash Sharma wrote:
>>
>>> Hi All,
>>> I have a spark job which runs over a huge bulk of data with Dynamic
>>> allocation enabled.
>>> The job takes some 15 minutes to start up and fails as soon as it
>>> starts*.
>>>
>>> Is there anything I can check to debug this problem. There is not a
>>> lot of information in logs for the exact cause but here is some snapshot
>>> below.
>>>
>>> Thanks All.
>>>
>>> * - by starts I mean when it shows something on the spark web ui,
>>> before that its just blank page.
>>>
>>> Logs here -
>>>
>>> {code}
>>> 16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter
>>> thread with (heartbeat : 3000, initial allocation : 200) intervals
>>> 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total
>>> number of 168510 executor(s).
>>> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
>>> containers, each with 2 cores and 6758 MB memory including 614 MB 
>>> overhead
>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>> for non-existent executor 22
>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>> for non-existent executor 19
>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>> for non-existent executor 18
>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>> for non-existent executor 12
>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>> for non-existent executor 11

Re: Tuning Spark memory

2016-09-23 Thread Takeshi Yamamuro

Hi,

Currently, the memory fraction of shuffle and storage is automatically
tuned by a memory manager.
So, you do not need to care the parameter in most cases.
See
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/memory/UnifiedMemoryManager.scala#L24

// maropu

On Fri, Sep 23, 2016 at 9:06 PM, tan shai  wrote:

> Hi,
>
> I am working with Spark 2.0, the job starts by sorting the input data and
> storing the output on HDFS.
>
> I am getting Out of memory errors, the solution was to increase the value
> of spark.shuffle.memoryFraction from 0.2 to 0.8 and this solves the
> problem. But in the documentation I have found that this is a deprecated
> parameter.
>
> As I have understand, It was replaced by spark.memory.fraction. How to
> modify this parameter while taking into account the sort and storage on
> HDFS?
>
> Thanks.
>

-- 
---
Takeshi Yamamuro

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-23 Thread Yash Sharma

Is there anywhere I can help fix this ?

I can see the requests being made in the yarn allocator. What should be the
upperlimit of the requests made ?

https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L222

On Sat, Sep 24, 2016 at 10:27 AM, Yash Sharma  wrote:

> Have been playing around with configs to crack this. Adding them here
> where it would be helpful to others :)
> Number of executors and timeout seemed like the core issue.
>
> {code}
> --driver-memory 4G \
> --conf spark.dynamicAllocation.enabled=true \
> --conf spark.dynamicAllocation.maxExecutors=500 \
> --conf spark.core.connection.ack.wait.timeout=6000 \
> --conf spark.akka.heartbeat.interval=6000 \
> --conf spark.akka.frameSize=100 \
> --conf spark.akka.timeout=6000 \
> {code}
>
> Cheers !
>
> On Fri, Sep 23, 2016 at 7:50 PM, 
> wrote:
>
>> For testing purpose can you run with fix number of executors and try. May
>> be 12 executors for testing and let know the status.
>>
>> Get Outlook for Android 
>>
>>
>>
>> On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma" 
>> wrote:
>>
>> Thanks Aditya, appreciate the help.
>>>
>>> I had the exact thought about the huge number of executors requested.
>>> I am going with the dynamic executors and not specifying the number of
>>> executors. Are you suggesting that I should limit the number of executors
>>> when the dynamic allocator requests for more number of executors.
>>>
>>> Its a 12 node EMR cluster and has more than a Tb of memory.
>>>
>>>
>>>
>>> On Fri, Sep 23, 2016 at 5:12 PM, Aditya >> co.in> wrote:
>>>
 Hi Yash,

 What is your total cluster memory and number of cores?
 Problem might be with the number of executors you are allocating. The
 logs shows it as 168510 which is on very high side. Try reducing your
 executors.


 On Friday 23 September 2016 12:34 PM, Yash Sharma wrote:

> Hi All,
> I have a spark job which runs over a huge bulk of data with Dynamic
> allocation enabled.
> The job takes some 15 minutes to start up and fails as soon as it
> starts*.
>
> Is there anything I can check to debug this problem. There is not a
> lot of information in logs for the exact cause but here is some snapshot
> below.
>
> Thanks All.
>
> * - by starts I mean when it shows something on the spark web ui,
> before that its just blank page.
>
> Logs here -
>
> {code}
> 16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter
> thread with (heartbeat : 3000, initial allocation : 200) intervals
> 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total number
> of 168510 executor(s).
> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
> containers, each with 2 cores and 6758 MB memory including 614 MB overhead
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 22
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 19
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 18
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 12
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 11
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 20
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 15
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 7
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 8
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 16
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 21
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 6
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 13
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 14
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 9
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 3
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 17
> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
> non-existent executor 1
> 16/09/23 06:33:36 WARN

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-23 Thread Yash Sharma

Have been playing around with configs to crack this. Adding them here where
it would be helpful to others :)
Number of executors and timeout seemed like the core issue.

{code}
--driver-memory 4G \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.maxExecutors=500 \
--conf spark.core.connection.ack.wait.timeout=6000 \
--conf spark.akka.heartbeat.interval=6000 \
--conf spark.akka.frameSize=100 \
--conf spark.akka.timeout=6000 \
{code}

Cheers !

On Fri, Sep 23, 2016 at 7:50 PM,  wrote:

> For testing purpose can you run with fix number of executors and try. May
> be 12 executors for testing and let know the status.
>
> Get Outlook for Android 
>
>
>
> On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma" 
> wrote:
>
> Thanks Aditya, appreciate the help.
>>
>> I had the exact thought about the huge number of executors requested.
>> I am going with the dynamic executors and not specifying the number of
>> executors. Are you suggesting that I should limit the number of executors
>> when the dynamic allocator requests for more number of executors.
>>
>> Its a 12 node EMR cluster and has more than a Tb of memory.
>>
>>
>>
>> On Fri, Sep 23, 2016 at 5:12 PM, Aditya > co.in> wrote:
>>
>>> Hi Yash,
>>>
>>> What is your total cluster memory and number of cores?
>>> Problem might be with the number of executors you are allocating. The
>>> logs shows it as 168510 which is on very high side. Try reducing your
>>> executors.
>>>
>>>
>>> On Friday 23 September 2016 12:34 PM, Yash Sharma wrote:
>>>
 Hi All,
 I have a spark job which runs over a huge bulk of data with Dynamic
 allocation enabled.
 The job takes some 15 minutes to start up and fails as soon as it
 starts*.

 Is there anything I can check to debug this problem. There is not a lot
 of information in logs for the exact cause but here is some snapshot below.

 Thanks All.

 * - by starts I mean when it shows something on the spark web ui,
 before that its just blank page.

 Logs here -

 {code}
 16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter
 thread with (heartbeat : 3000, initial allocation : 200) intervals
 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total number
 of 168510 executor(s).
 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
 containers, each with 2 cores and 6758 MB memory including 614 MB overhead
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 22
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 19
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 18
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 12
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 11
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 20
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 15
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 7
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 8
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 16
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 21
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 6
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 13
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 14
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 9
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 3
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 17
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 1
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 10
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 4
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 2
 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
 non-existent executor 5
 16/09/23 06:33:36 WARN ApplicationMaster: Reporter thread fails 1
 time(s) in a row.

Error in run multiple unit test that extends DataFrameSuiteBase

2016-09-23 Thread Jinyuan Zhou

After I created two test case  that FlatSpec with DataFrameSuiteBase. But I
got errors when do sbt test. I was able to run each of them separately. My
test cases does use sqlContext to read files. Here is the exception stack.
Judging from the exception, I may need to unregister RpcEndpoint after each
test run.
info] Exception encountered when attempting to run a suite with class name:
 MyTestSuit *** ABORTED ***
[info]   java.lang.IllegalArgumentException: There is already an
RpcEndpoint called LocalSchedulerBackendEndpoint
[info]   at
org.apache.spark.rpc.netty.Dispatcher.registerRpcEndpoint(Dispatcher.scala:66)
[info]   at
org.apache.spark.rpc.netty.NettyRpcEnv.setupEndpoint(NettyRpcEnv.scala:129)
[info]   at
org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:127)
[info]   at
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
[info]   at org.apache.spark.SparkContext.(SparkContext.scala:500)

Re: With spark DataFrame, how to write to existing folder?

2016-09-23 Thread Yong Zhang

df.write.format(source).mode("overwrite").save(path)

Yong

From: Dan Bikle 
Sent: Friday, September 23, 2016 6:45 PM
To: user@spark.apache.org
Subject: With spark DataFrame, how to write to existing folder?

spark-world,

I am walking through the example here:

https://github.com/databricks/spark-csv#scala-api

The example complains if I try to write a DataFrame to an existing folder:

val selectedData = df.select("year", "model")
selectedData.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("newcars.csv")

I used google to look for DataFrame.write() API.

It sent me here:
http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframe-data-readerwriter-interface

There I found this link:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame@write:DataFrameWriter

And that link is a 404-error.

Question:
How to enhance this call so it overwrites instead of failing:

selectedData.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("newcars.csv")
??

With spark DataFrame, how to write to existing folder?

2016-09-23 Thread Dan Bikle

spark-world,

I am walking through the example here:

https://github.com/databricks/spark-csv#scala-api

The example complains if I try to write a DataFrame to an existing folder:





*val selectedData = df.select("year", "model")selectedData.write
.format("com.databricks.spark.csv").option("header", "true")
.save("newcars.csv")*

I used google to look for DataFrame.write() API.

It sent me here:
http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframe-data-readerwriter-interface

There I found this link:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame@write:DataFrameWriter

And that link is a 404-error.

Question:
How to enhance this call so it overwrites instead of failing:





*selectedData.write.format("com.databricks.spark.csv")
.option("header", "true").save("newcars.csv")*??

Re: Spark Yarn Cluster with Reference File

2016-09-23 Thread ayan guha

You may try copying the file to same location on all nodes and try to read
from that place
On 24 Sep 2016 00:20, "ABHISHEK"  wrote:

> I have tried with hdfs/tmp location but it didn't work. Same error.
>
> On 23 Sep 2016 19:37, "Aditya"  wrote:
>
>> Hi Abhishek,
>>
>> Try below spark submit.
>> spark-submit --master yarn --deploy-mode cluster  --files hdfs://
>> abc.com:8020/tmp/abc.drl --class com.abc.StartMain
>> abc-0.0.1-SNAPSHOT-jar-with-dependencies.jar abc.drl
>> 
>>
>> On Friday 23 September 2016 07:29 PM, ABHISHEK wrote:
>>
>> Thanks for your response Aditya and Steve.
>> Steve:
>> I have tried specifying both /tmp/filename in hdfs and local path but it
>> didn't work.
>> You may be write that Kie session is configured  to  access files from
>> Local path.
>> I have attached code here for your reference and if you find some thing
>> wrong, please help to correct it.
>>
>> Aditya:
>> I have attached code here for reference. --File option will distributed
>> reference file to all node but  Kie session is not able  to pickup it.
>>
>> Thanks,
>> Abhishek
>>
>> On Fri, Sep 23, 2016 at 2:25 PM, Steve Loughran 
>> wrote:
>>
>>>
>>> On 23 Sep 2016, at 08:33, ABHISHEK  wrote:
>>>
>>> at java.lang.Thread.run(Thread.java:745)
>>> Caused by: java.io.FileNotFoundException: hdfs:/abc.com:8020/user/abhiet
>>> c/abc.drl (No such file or directory)
>>> at java.io.FileInputStream.open(Native Method)
>>> at java.io.FileInputStream.(FileInputStream.java:146)
>>> at org.drools.core.io.impl.FileSystemResource.getInputStream(Fi
>>> leSystemResource.java:123)
>>> at org.drools.compiler.kie.builder.impl.KieFileSystemImpl.write
>>> (KieFileSystemImpl.java:58)
>>>
>>>
>>>
>>> Looks like this .KieFileSystemImpl class only works with local files, so
>>> when it gets an HDFS path in it tries to open it and gets confused.
>>>
>>> you may need to write to a local FS temp file then copy it into HDFS
>>>
>>
>>
>>
>>

Running Spark master/slave instances in non Daemon mode

2016-09-23 Thread Jeff Puro

Hi,

I recently tried deploying Spark master and slave instances to container
based environments such as Docker, Nomad etc. There are two issues that
I've found with how the startup scripts work. The sbin/start-master.sh and
sbin/start-slave.sh start a daemon by default, but this isn't as compatible
with container deployments as one would think. The first issue is that the
daemon runs in the background and some container solutions require the apps
to run in the foreground or they consider the application to not be running
and they may close down the task. The second issue is that logs don't seem
to get integrated with the logging mechanism in the container solution.
What is the possibility of adding additional flags or startup scripts for
supporting Spark to run in the foreground? It would be great if a flag
like SPARK_NO_DAEMONIZE could be added or another script for foreground
execution.

Regards,

Jeff

Re: Is executor computing time affected by network latency?

2016-09-23 Thread Mark Hamstra

>
> The best network results are achieved when Spark nodes share the same
> hosts as Hadoop or they happen to be on the same subnet.
>

That's only true for those portions of a Spark execution pipeline that are
actually reading from HDFS.  If you're re-using an RDD for which the needed
shuffle files are already available on Executor nodes or are looking at
stages of a Spark SQL query execution later than those reading from HDFS,
then data locality and network utilization concerns don't really have
anything to do with co-location of Executors and HDFS data nodes.

On Fri, Sep 23, 2016 at 1:31 PM, Mich Talebzadeh 
wrote:

> Does this assume that Spark is running on the same hosts as HDFS? Hence
> does increasing the latency affects the network latency on Hadoop nodes as
> well in your tests?
>
> The best network results are achieved when Spark nodes share the same
> hosts as Hadoop or they happen to be on the same subnet.
>
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 22 September 2016 at 14:54, gusiri  wrote:
>
>> Hi,
>>
>> When I increase the network latency among spark nodes,
>>
>> I see compute time (=executor computing time in Spark Web UI) also
>> increases.
>>
>> In the graph attached, left = latency 1ms vs right = latency 500ms.
>>
>> Is there any communication between worker and driver/master even 'during'
>> executor computing? or any idea on this result?
>>
>>
>> > n27779/Screen_Shot_2016-09-21_at_5.png>
>>
>>
>>
>>
>>
>> Thank you very much in advance.
>>
>> //gusiri
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Is-executor-computing-time-affected-
>> by-network-latency-tp27779.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>

Re: Optimal/Expected way to run demo spark-scala scripts?

2016-09-23 Thread Kevin Mellott

You can run Spark code using the command line or by creating a JAR file
(via IntelliJ or other IDE); however, you may wish to try a Databricks
Community Edition account instead. They offer Spark as a managed service,
and you can run Spark commands one at a time via interactive notebooks.
There are built-in visualization tools, with the ability to integrate to
3rd party ones if you wish.

This type of development is very similar to iPython, but they provide a 6GB
cluster with their free accounts. They also provide many example notebooks
to help you learn various aspects of Spark.

https://databricks.com/try-databricks

Thanks,
Kevin

On Fri, Sep 23, 2016 at 2:37 PM, Dan Bikle  wrote:

> hello spark-world,
>
> I am new to spark and want to learn how to use it.
>
> I come from the Python world.
>
> I see an example at the url below:
>
> http://spark.apache.org/docs/latest/ml-pipeline.html#
> example-estimator-transformer-and-param
>
> What would be an optimal way to run the above example?
>
> In the Python world I would just feed the name of the script to Python on
> the command line.
>
> In the spark-world would people just start spark-shell and use a mouse to
> feed in the syntax?
>
> Perhaps people would follow the example here which uses a combo of sbt and
> spark-submit:
>
> http://spark.apache.org/docs/latest/quick-start.html#self-
> contained-applications
>
> ??
>
> Perhaps people usually have a Java-mindset and use an IDE built for
> spark-development?
> If so, which would be considered the best IDE for Spark? IntelliJ?
>
>

Re: databricks spark-csv: linking coordinates are what?

2016-09-23 Thread Holden Karau

So the good news is the csv library has been integrated into Spark 2.0 so
you don't need to use that package. On the other hand if your in an older
version you can included it using the standard sbt or  maven package
configuration.

On Friday, September 23, 2016, Dan Bikle  wrote:

> hello world-of-spark,
>
> I am learning spark today.
>
> I want to understand the spark code in this repo:
>
> https://github.com/databricks/spark-csv
>
> In the README.md I see this info:
>
> Linking
>
> You can link against this library in your program at the following
> coordinates:
> Scala 2.10
>
> groupId: com.databricks
> artifactId: spark-csv_2.10
> version: 1.5.0
>
> Scala 2.11
>
> groupId: com.databricks
> artifactId: spark-csv_2.11
> version: 1.5.0
>
> I want to know how I can use the above info.
>
> The people who wrote spark-csv should give some kind of example, demo, or
> context.
>
> My understanding of Linking is limited.
>
> I have some experience operating sbt which I learned from this URL:
>
> http://spark.apache.org/docs/latest/quick-start.html#self-
> contained-applications
>
> The above URL does not give me enough information so that I can link
> spark-csv with spark.
>
> Question:
> How do I learn how to use the info in the Linking section of the README.md
> of
> https://github.com/databricks/spark-csv
> ??
>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: Is executor computing time affected by network latency?

2016-09-23 Thread Mich Talebzadeh

Does this assume that Spark is running on the same hosts as HDFS? Hence
does increasing the latency affects the network latency on Hadoop nodes as
well in your tests?

The best network results are achieved when Spark nodes share the same hosts
as Hadoop or they happen to be on the same subnet.

HTH

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 22 September 2016 at 14:54, gusiri  wrote:

> Hi,
>
> When I increase the network latency among spark nodes,
>
> I see compute time (=executor computing time in Spark Web UI) also
> increases.
>
> In the graph attached, left = latency 1ms vs right = latency 500ms.
>
> Is there any communication between worker and driver/master even 'during'
> executor computing? or any idea on this result?
>
>
>  file/n27779/Screen_Shot_2016-09-21_at_5.png>
>
>
>
>
>
> Thank you very much in advance.
>
> //gusiri
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Is-executor-computing-time-
> affected-by-network-latency-tp27779.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

databricks spark-csv: linking coordinates are what?

2016-09-23 Thread Dan Bikle

hello world-of-spark,

I am learning spark today.

I want to understand the spark code in this repo:

https://github.com/databricks/spark-csv

In the README.md I see this info:

Linking

You can link against this library in your program at the following
coordinates:
Scala 2.10

groupId: com.databricks
artifactId: spark-csv_2.10
version: 1.5.0

Scala 2.11

groupId: com.databricks
artifactId: spark-csv_2.11
version: 1.5.0

I want to know how I can use the above info.

The people who wrote spark-csv should give some kind of example, demo, or
context.

My understanding of Linking is limited.

I have some experience operating sbt which I learned from this URL:

http://spark.apache.org/docs/latest/quick-start.html#self-contained-applications

The above URL does not give me enough information so that I can link
spark-csv with spark.

Question:
How do I learn how to use the info in the Linking section of the README.md
of
https://github.com/databricks/spark-csv
??

Re: Is executor computing time affected by network latency?

2016-09-23 Thread Peter Figliozzi

See the reference on shuffles
,
"Spark’s mechanism for re-distributing data so that it’s grouped
differently across partitions. This typically involves copying data across
executors and machines, making the shuffle a complex and costly operation."



On Thu, Sep 22, 2016 at 4:14 PM, Soumitra Johri <
soumitra.siddha...@gmail.com> wrote:

> If your job involves a shuffle then the compute for the entire batch will
> increase with network latency. What would be interesting is to see how much
> time each task/job/stage takes.
>
> On Thu, Sep 22, 2016 at 5:11 PM Peter Figliozzi 
> wrote:
>
>> It seems to me they must communicate for joins, sorts, grouping, and so
>> forth, where the original data partitioning needs to change.  You could
>> repeat your experiment for different code snippets.  I'll bet it depends on
>> what you do.
>>
>> On Thu, Sep 22, 2016 at 8:54 AM, gusiri  wrote:
>>
>>> Hi,
>>>
>>> When I increase the network latency among spark nodes,
>>>
>>> I see compute time (=executor computing time in Spark Web UI) also
>>> increases.
>>>
>>> In the graph attached, left = latency 1ms vs right = latency 500ms.
>>>
>>> Is there any communication between worker and driver/master even 'during'
>>> executor computing? or any idea on this result?
>>>
>>>
>>> >> file/n27779/Screen_Shot_2016-09-21_at_5.png>
>>>
>>>
>>>
>>>
>>>
>>> Thank you very much in advance.
>>>
>>> //gusiri
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/Is-executor-computing-time-
>>> affected-by-network-latency-tp27779.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>

Re: Open source Spark based projects

2016-09-23 Thread manasdebashiskar

check out spark packages https://spark-packages.org/ and you will find few
awesome and a lot of super awesome projects.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Open-source-Spark-based-projects-tp27778p27788.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Optimal/Expected way to run demo spark-scala scripts?

2016-09-23 Thread Dan Bikle

hello spark-world,

I am new to spark and want to learn how to use it.

I come from the Python world.

I see an example at the url below:

http://spark.apache.org/docs/latest/ml-pipeline.html#example-estimator-transformer-and-param

What would be an optimal way to run the above example?

In the Python world I would just feed the name of the script to Python on
the command line.

In the spark-world would people just start spark-shell and use a mouse to
feed in the syntax?

Perhaps people would follow the example here which uses a combo of sbt and
spark-submit:

http://spark.apache.org/docs/latest/quick-start.html#self-contained-applications

??

Perhaps people usually have a Java-mindset and use an IDE built for
spark-development?
If so, which would be considered the best IDE for Spark? IntelliJ?

Can somebody remove this guy?

2016-09-23 Thread Dirceu Semighini Filho

Can somebody remove this guy from the list
tod...@yahoo-inc.com
Just sent a message to the list and received an mail from yahoo saying that
this email doesn't exist anymore.

This is an automatically generated message.

tod...@yahoo-inc.com is no longer with Yahoo! Inc.

Your message will not be forwarded.

If you have a sales inquiry, please email yahoosa...@yahoo-inc.com and
someone will follow up with you shortly.

If you require assistance with a legal matter, please send a message to
legal-noti...@yahoo-inc.com

Thank you!

Regards,
Dirceu

Re: 答复: 答复: it does not stop at breakpoints which is in an anonymous function

2016-09-23 Thread Dirceu Semighini Filho

Hi Felix,
Just runned your code and it prints

Pi is roughly 4.0

Here is the code that I used as you didn't show what a random is I used the
nextInt()

 val n = math.min(10L * slices, Int.MaxValue).toInt // avoid overflow
val count = context.sparkContext.parallelize(1 until n, slices).map { i
=>
  val random = new scala.util.Random(1000).nextInt()
  val x = random * 2 - 1  //(breakpoint-1)
  val y = random * 2 - 1
  if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / (n - 1))
context.sparkContext.stop()

Also using the debug it stops into the map (breakpoint 1) before going to
print

2016-09-18 6:47 GMT-03:00 chen yong :

> Dear Dirceu,
>
>
> Below is  our testing codes, as you can see, we have used "reduce" action
> to evoke evaluation. However, it still did not stop at breakpoint-1(as
> shown in the the code snippet) when debugging.
>
>
>
> We are using IDEA  version 14.0.3 to debug.  It very very strange to us.
> Please help us(me and my colleagues).
>
>
> // scalastyle:off println
> package org.apache.spark.examples
> import scala.math.random
> import org.apache.spark._
> import scala.util.logging.Logged
>
> /** Computes an approximation to pi */
> object SparkPi{
>   def main(args: Array[String]) {
>
> val conf = new SparkConf().setAppName("Spark Pi").setMaster("local")
> val spark = new SparkContext(conf)
> val slices = if (args.length > 0) args(0).toInt else 2
> val n = math.min(10L * slices, Int.MaxValue).toInt // avoid
> overflow
> val count = spark.parallelize(1 until n, slices).map { i =>
> val x = random * 2 - 1  (breakpoint-1)
> val y = random * 2 - 1
> if (x*x + y*y < 1) 1 else 0
>   }.reduce(_ + _)
> println("Pi is roughly " + 4.0 * count / (n - 1))
> spark.stop()
>   }
> }
>
>
>
>
> --
> *发件人:* Dirceu Semighini Filho 
> *发送时间:* 2016年9月16日 22:27
> *收件人:* chen yong
> *抄送:* user@spark.apache.org
> *主题:* Re: 答复: it does not stop at breakpoints which is in an anonymous
> function
>
> No, that's not the right way of doing it.
> Remember that RDD operations are lazy, due to performance reasons.
> Whenever you call one of those operation methods (count, reduce, collect,
> ...) they will execute all the functions that you have done to create that
> RDD.
> It would help if you can post your code here, and also the way that you
> are executing it, and trying to debug.
>
>
> 2016-09-16 11:23 GMT-03:00 chen yong :
>
>> Also, I wonder what is the right way to debug  spark program. If I use
>> ten anonymous function in one spark program, for debugging each of them, i
>> have to place a COUNT action in advace and then remove it after debugging.
>> Is that the right way?
>>
>>
>> --
>> *发件人:* Dirceu Semighini Filho 
>> *发送时间:* 2016年9月16日 21:07
>> *收件人:* chen yong
>> *抄送:* user@spark.apache.org
>> *主题:* Re: 答复: 答复: 答复: 答复: t it does not stop at breakpoints which is in
>> an anonymous function
>>
>> Hello Felix,
>> No, this line isn't the one that is triggering the execution of the
>> function, the count does that, unless your count val is a lazy val.
>> The count method is the one that retrieves the information of the rdd, it
>> has do go through all of it's data do determine how many records the RDD
>> has.
>>
>> Regards,
>>
>> 2016-09-15 22:23 GMT-03:00 chen yong :
>>
>>>
>>> Dear Dirceu,
>>>
>>> Thanks for your kind help.
>>> i cannot see any code line corresponding to ". retrieve the data
>>> from your DataFrame/RDDs". which you suggested in the previous replies.
>>>
>>> Later, I guess
>>>
>>> the line
>>>
>>> val test = count
>>>
>>> is the key point. without it, it would not stop at the breakpont-1,
>>> right?
>>>
>>>
>>>
>>> --
>>> *发件人:* Dirceu Semighini Filho 
>>> *发送时间:* 2016年9月16日 0:39
>>> *收件人:* chen yong
>>> *抄送:* user@spark.apache.org
>>> *主题:* Re: 答复: 答复: 答复: t it does not stop at breakpoints which is in an
>>> anonymous function
>>>
>>> Hi Felix,
>>> Are sure your n is greater than 0?
>>> Here it stops first at breakpoint 1, image attached.
>>> Have you got the count to see if it's also greater than 0?
>>>
>>> 2016-09-15 11:41 GMT-03:00 chen yong :
>>>
 Dear Dirceu


 Thank you for your help.


 Acutally, I use Intellij IDEA to dubug the spark code.


 Let me use the following code snippet to illustrate my problem. In the
 code lines below, I've set two breakpoints, breakpoint-1 and breakpoint-2.
 when i debuged the code, it did not stop at breakpoint-1, it seems
 that the map

 function was skipped and it directly reached and stoped at the
 breakpoint-2.

 Additionally, I find the following two posts

Re: Error while Spark 1.6.1 streaming from Kafka-2.11_0.10.0.1 cluster

2016-09-23 Thread sagarcasual .

Hi, Thanks for the response,
The issue I am facing is only for the clustered Kafka 2.11 based version
0.10.0.1 and Spark 1.6.1 with following dependencies.
org.apache.spark:spark-core_2.10:1.6.1
compile group: 'org.apache.spark', name: 'spark-streaming_2.10',
version:'1.6.1'
compile group: 'org.apache.spark', name: 'spark-streaming-kafka_2.10',
version:'1.6.1'

For standalone kafka with rest versions remaining same, I am perfectly
able to connect and consume from Kafka using Spark Streaming.

DO you know if we have to do anything special while doing
createDirectStream/RDD in Clustered 0.10 Kafka consumed via Spark
1.6.1?

-Regards

Sagar



On Fri, Sep 23, 2016 at 7:05 AM, Cody Koeninger  wrote:

> For Spark 2.0 there are two kafka artifacts,
> spark-streaming-kafka-0-10 (0.10 and later brokers only) and
> spark-streaming-kafka-0-8 (should work with 0.8 and later brokers).
> The docs explaining this were merged to master just after 2.0
> released, so they haven't been published yet.
>
> There are usage examples of the 0-10 connector at
> https://github.com/koeninger/kafka-exactly-once
>
> The change to Spark 2.0 was really straightforward for the one or two
> jobs I switched over, for what it's worth.
>
> On Fri, Sep 23, 2016 at 12:31 AM, sagarcasual . 
> wrote:
> > Also you mentioned about streaming-kafka-0-10 connector, what connector
> is
> > this, do you know the dependency ? I did not see mention of it in the
> > documents
> > For current Spark 1.6.1 to Kafka 0.10.0.1 standalone, the only
> dependencies
> > I have are
> >
> > org.apache.spark:spark-core_2.10:1.6.1
> > compile group: 'org.apache.spark', name: 'spark-streaming_2.10',
> > version:'1.6.1'
> > compile group: 'org.apache.spark', name: 'spark-streaming-kafka_2.10',
> > version:'1.6.1'
> > compile group: 'org.apache.spark', name: 'spark-sql_2.10', version:
> '1.6.1'
> >
> > For Spark 2.0 with Kafka 0.10.0.1 do I need to have a different kafka
> > connector dependency?
> >
> >
> > On Thu, Sep 22, 2016 at 2:21 PM, sagarcasual . 
> > wrote:
> >>
> >> Hi Cody,
> >> Thanks for the response.
> >> One thing I forgot to mention is I am using a Direct Approach (No
> >> receivers) in Spark streaming.
> >>
> >> I am not sure if I have that leverage to upgrade at this point, but do
> you
> >> know if Spark 1.6.1 to Spark 2.0 jump is smooth usually or does it
> involve
> >> lot of hick-ups.
> >> Also is there a migration guide or something?
> >>
> >> -Regards
> >> Sagar
> >>
> >> On Thu, Sep 22, 2016 at 1:39 PM, Cody Koeninger 
> >> wrote:
> >>>
> >>> Do you have the ability to try using Spark 2.0 with the
> >>> streaming-kafka-0-10 connector?
> >>>
> >>> I'd expect the 1.6.1 version to be compatible with kafka 0.10, but it
> >>> would be good to rule that out.
> >>>
> >>> On Thu, Sep 22, 2016 at 1:37 PM, sagarcasual . 
> >>> wrote:
> >>> > Hello,
> >>> >
> >>> > I am trying to stream data out of kafka cluster (2.11_0.10.0.1) using
> >>> > Spark
> >>> > 1.6.1
> >>> > I am receiving following error, and I confirmed that Topic to which I
> >>> > am
> >>> > trying to connect exists with the data .
> >>> >
> >>> > Any idea what could be the case?
> >>> >
> >>> > kafka.common.UnknownTopicOrPartitionException
> >>> > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >>> > Method)
> >>> > at
> >>> >
> >>> > sun.reflect.NativeConstructorAccessorImpl.newInstance(
> NativeConstructorAccessorImpl.java:62)
> >>> > at
> >>> >
> >>> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:45)
> >>> > at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> >>> > at java.lang.Class.newInstance(Class.java:442)
> >>> > at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:102)
> >>> > at
> >>> >
> >>> > org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.
> handleFetchErr(KafkaRDD.scala:184)
> >>> > at
> >>> >
> >>> > org.apache.spark.streaming.kafka.KafkaRDD$
> KafkaRDDIterator.fetchBatch(KafkaRDD.scala:193)
> >>> > at
> >>> >
> >>> > org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(
> KafkaRDD.scala:208)
> >>> > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> >>> > at
> >>> >
> >>> > scala.collection.convert.Wrappers$IteratorWrapper.
> hasNext(Wrappers.scala:29)
> >>> >
> >>> >
> >>
> >>
> >
> >
>

Spark MLlib ALS algorithm

2016-09-23 Thread Roshani Nagmote

Hello,

I was working on Spark MLlib ALS Matrix factorization algorithm and came
across the following blog post:

https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html

Can anyone help me understanding what "s" scaling factor does and does it
really give better performance? What's the significance of this?
If we convert input data to scaledData with the help of "s", will it
speedup the algorithm?

Scaled data usage:
*(For each user, we create pseudo-users that have the same ratings. That
is, for every rating as (userId, productId, rating), we generate (userId+i,
productId, rating) where 0 <= i < s and s is the scaling factor)*

Also, this blogpost is for spark 1.1 and I am currently using 2.0

Any help will be greatly appreciated.

Thanks,
Roshani

Re: Spark Yarn Cluster with Reference File

2016-09-23 Thread ABHISHEK

I have tried with hdfs/tmp location but it didn't work. Same error.

On 23 Sep 2016 19:37, "Aditya"  wrote:

> Hi Abhishek,
>
> Try below spark submit.
> spark-submit --master yarn --deploy-mode cluster  --files hdfs://
> abc.com:8020/tmp/abc.drl --class com.abc.StartMain
> abc-0.0.1-SNAPSHOT-jar-with-dependencies.jar abc.drl
> 
>
> On Friday 23 September 2016 07:29 PM, ABHISHEK wrote:
>
> Thanks for your response Aditya and Steve.
> Steve:
> I have tried specifying both /tmp/filename in hdfs and local path but it
> didn't work.
> You may be write that Kie session is configured  to  access files from
> Local path.
> I have attached code here for your reference and if you find some thing
> wrong, please help to correct it.
>
> Aditya:
> I have attached code here for reference. --File option will distributed
> reference file to all node but  Kie session is not able  to pickup it.
>
> Thanks,
> Abhishek
>
> On Fri, Sep 23, 2016 at 2:25 PM, Steve Loughran 
> wrote:
>
>>
>> On 23 Sep 2016, at 08:33, ABHISHEK  wrote:
>>
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.io.FileNotFoundException: hdfs:/abc.com:8020/user/abhiet
>> c/abc.drl (No such file or directory)
>> at java.io.FileInputStream.open(Native Method)
>> at java.io.FileInputStream.(FileInputStream.java:146)
>> at org.drools.core.io.impl.FileSystemResource.getInputStream(Fi
>> leSystemResource.java:123)
>> at org.drools.compiler.kie.builder.impl.KieFileSystemImpl.
>> write(KieFileSystemImpl.java:58)
>>
>>
>>
>> Looks like this .KieFileSystemImpl class only works with local files, so
>> when it gets an HDFS path in it tries to open it and gets confused.
>>
>> you may need to write to a local FS temp file then copy it into HDFS
>>
>
>
>
>

Re: Spark Yarn Cluster with Reference File

2016-09-23 Thread Aditya


Hi Abhishek,

Try below spark submit.
spark-submit --master yarn --deploy-mode cluster  --files 
hdfs://abc.com:8020/tmp/abc.drl  
--class com.abc.StartMain abc-0.0.1-SNAPSHOT-jar-with-dependencies.jar 
abc.drl 


On Friday 23 September 2016 07:29 PM, ABHISHEK wrote:

Thanks for your response Aditya and Steve.
Steve:
I have tried specifying both /tmp/filename in hdfs and local path but 
it didn't work.
You may be write that Kie session is configured  to  access files from 
Local path.
I have attached code here for your reference and if you find some 
thing wrong, please help to correct it.


Aditya:
I have attached code here for reference. --File option will 
distributed reference file to all node but  Kie session is not able 
 to pickup it.


Thanks,
Abhishek

On Fri, Sep 23, 2016 at 2:25 PM, Steve Loughran 
> wrote:




On 23 Sep 2016, at 08:33, ABHISHEK > wrote:

at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException:
hdfs:/abc.com:8020/user/abhietc/abc.drl
(No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.(FileInputStream.java:146)
at

org.drools.core.io.impl.FileSystemResource.getInputStream(FileSystemResource.java:123)
at

org.drools.compiler.kie.builder.impl.KieFileSystemImpl.write(KieFileSystemImpl.java:58)



Looks like this .KieFileSystemImpl class only works with local
files, so when it gets an HDFS path in it tries to open it and
gets confused.

you may need to write to a local FS temp file then copy it into HDFS

Re: Error while Spark 1.6.1 streaming from Kafka-2.11_0.10.0.1 cluster

2016-09-23 Thread Cody Koeninger

For Spark 2.0 there are two kafka artifacts,
spark-streaming-kafka-0-10 (0.10 and later brokers only) and
spark-streaming-kafka-0-8 (should work with 0.8 and later brokers).
The docs explaining this were merged to master just after 2.0
released, so they haven't been published yet.

There are usage examples of the 0-10 connector at
https://github.com/koeninger/kafka-exactly-once

The change to Spark 2.0 was really straightforward for the one or two
jobs I switched over, for what it's worth.

On Fri, Sep 23, 2016 at 12:31 AM, sagarcasual .  wrote:
> Also you mentioned about streaming-kafka-0-10 connector, what connector is
> this, do you know the dependency ? I did not see mention of it in the
> documents
> For current Spark 1.6.1 to Kafka 0.10.0.1 standalone, the only dependencies
> I have are
>
> org.apache.spark:spark-core_2.10:1.6.1
> compile group: 'org.apache.spark', name: 'spark-streaming_2.10',
> version:'1.6.1'
> compile group: 'org.apache.spark', name: 'spark-streaming-kafka_2.10',
> version:'1.6.1'
> compile group: 'org.apache.spark', name: 'spark-sql_2.10', version: '1.6.1'
>
> For Spark 2.0 with Kafka 0.10.0.1 do I need to have a different kafka
> connector dependency?
>
>
> On Thu, Sep 22, 2016 at 2:21 PM, sagarcasual . 
> wrote:
>>
>> Hi Cody,
>> Thanks for the response.
>> One thing I forgot to mention is I am using a Direct Approach (No
>> receivers) in Spark streaming.
>>
>> I am not sure if I have that leverage to upgrade at this point, but do you
>> know if Spark 1.6.1 to Spark 2.0 jump is smooth usually or does it involve
>> lot of hick-ups.
>> Also is there a migration guide or something?
>>
>> -Regards
>> Sagar
>>
>> On Thu, Sep 22, 2016 at 1:39 PM, Cody Koeninger 
>> wrote:
>>>
>>> Do you have the ability to try using Spark 2.0 with the
>>> streaming-kafka-0-10 connector?
>>>
>>> I'd expect the 1.6.1 version to be compatible with kafka 0.10, but it
>>> would be good to rule that out.
>>>
>>> On Thu, Sep 22, 2016 at 1:37 PM, sagarcasual . 
>>> wrote:
>>> > Hello,
>>> >
>>> > I am trying to stream data out of kafka cluster (2.11_0.10.0.1) using
>>> > Spark
>>> > 1.6.1
>>> > I am receiving following error, and I confirmed that Topic to which I
>>> > am
>>> > trying to connect exists with the data .
>>> >
>>> > Any idea what could be the case?
>>> >
>>> > kafka.common.UnknownTopicOrPartitionException
>>> > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>> > Method)
>>> > at
>>> >
>>> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>>> > at
>>> >
>>> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>> > at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>>> > at java.lang.Class.newInstance(Class.java:442)
>>> > at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:102)
>>> > at
>>> >
>>> > org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.handleFetchErr(KafkaRDD.scala:184)
>>> > at
>>> >
>>> > org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:193)
>>> > at
>>> >
>>> > org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:208)
>>> > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>>> > at
>>> >
>>> > scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:29)
>>> >
>>> >
>>
>>
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark Yarn Cluster with Reference File

2016-09-23 Thread ABHISHEK

Thanks for your response Aditya and Steve.
Steve:
I have tried specifying both /tmp/filename in hdfs and local path but it
didn't work.
You may be write that Kie session is configured  to  access files from
Local path.
I have attached code here for your reference and if you find some thing
wrong, please help to correct it.

Aditya:
I have attached code here for reference. --File option will distributed
reference file to all node but  Kie session is not able  to pickup it.

Thanks,
Abhishek

On Fri, Sep 23, 2016 at 2:25 PM, Steve Loughran 
wrote:

>
> On 23 Sep 2016, at 08:33, ABHISHEK  wrote:
>
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.FileNotFoundException: hdfs:/abc.com:8020/user/
> abhietc/abc.drl (No such file or directory)
> at java.io.FileInputStream.open(Native Method)
> at java.io.FileInputStream.(FileInputStream.java:146)
> at org.drools.core.io.impl.FileSystemResource.getInputStream(
> FileSystemResource.java:123)
> at org.drools.compiler.kie.builder.impl.KieFileSystemImpl.write(
> KieFileSystemImpl.java:58)
>
>
>
> Looks like this .KieFileSystemImpl class only works with local files, so
> when it gets an HDFS path in it tries to open it and gets confused.
>
> you may need to write to a local FS temp file then copy it into HDFS
>
object Mymain {
  def main(args: Array[String]): Unit = {

   // @  abhishek 
   //val fileName = "./abc.drl"   //  code works if I run app in local mode 
   
  
val fileName = args(0)
val conf = new SparkConf().setAppName("LocalStreaming")
val sc = new SparkContext(conf) 
val ssc = new StreamingContext(sc, Seconds(1)) 
val brokers = "190.51.231.132:9092"
val groupId = "testgroup"
val offsetReset = "smallest"
val pollTimeout = "1000"
val topics = "NorthPole"
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String](
  ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
  ConsumerConfig.GROUP_ID_CONFIG -> groupId,
  ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> 
"org.apache.kafka.common.serialization.StringDeserializer",
  ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> 
"org.apache.kafka.common.serialization.StringDeserializer",
  ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> offsetReset,
  ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> "false",
  "spark.kafka.poll.time" -> pollTimeout)

val detailTable = "emp1 "
val summaryTable= "emp2"
val confHBase = HBaseConfiguration.create()
confHBase.set("hbase.zookeeper.quorum", "190.51.231.132")
confHBase.set("hbase.zookeeper.property.clientPort", "2181")
confHBase.set("hbase.master", "190.51.231.132:6")
val emp_detail_config = Job.getInstance(confHBase)
val emp_summary_config = Job.getInstance(confHBase)
 
emp_detail_config.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, 
"emp_detail");

emp_detail_config.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
  
emp_summary_config.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, 
"emp_summary")

emp_summary_config.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])

val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, 
StringDecoder](ssc, kafkaParams, topicsSet)

messages.foreachRDD(rdd => {
  if (rdd.count > 0) {

val messageRDD: RDD[String] = rdd.map { y => y._2 }
val inputJsonObjectRDD = messageRDD.map(row => 
Utilities.convertToJsonObject(row))

inputJsonObjectRDD.map(row => 
BuildJsonStrctures.buildTaxCalcDetail(row)).saveAsNewAPIHadoopDataset(emp_detail_config.getConfiguration)

val inputObjectRDD = messageRDD.map(row => 
Utilities.convertToSubmissionJavaObject(row))
val executedObjectRDD = inputObjectRDD.mapPartitions(row => 
KieSessionFactory.execute(row, fileName.toString()))
val executedJsonRDD = executedObjectRDD.map(row => 
Utilities.convertToSubmissionJSonString(row))
  .map(row => Utilities.convertToJsonObject(row))

val summaryInputJsonRDD = executedObjectRDD

summaryInputJsonRDD.map(row => 
BuildJsonStrctures.buildSummary2(row)).saveAsNewAPIHadoopDataset(emp_summary_config.getConfiguration)

  } else {
println("No message received") //this works only in master local mode
  }
})
ssc.start()
ssc.awaitTermination()

  }
}
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

ERROR StandaloneSchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.

2016-09-23 Thread muhammet pakyürek


i tried to connect cassandra via spark-cassandra-conenctor2.0.0 on pyspark but 
i get the error below

  i think it s related to pyspark/context.py but i dont know how?

Tuning Spark memory

2016-09-23 Thread tan shai

Hi,

I am working with Spark 2.0, the job starts by sorting the input data and
storing the output on HDFS.

I am getting Out of memory errors, the solution was to increase the value
of spark.shuffle.memoryFraction from 0.2 to 0.8 and this solves the
problem. But in the documentation I have found that this is a deprecated
parameter.

As I have understand, It was replaced by spark.memory.fraction. How to
modify this parameter while taking into account the sort and storage on
HDFS?

Thanks.

Re: Apache Spark JavaRDD pipe() need help

2016-09-23 Thread शशिकांत कुलकर्णी

Thank you Jakob. I will try as suggested.

Regards,
Shashi

On Fri, Sep 23, 2016 at 12:14 AM, Jakob Odersky  wrote:

> Hi Shashikant,
>
> I think you are trying to do too much at once in your helper class.
> Spark's RDD API is functional, it is meant to be used by writing many
> little transformations that will be distributed across a cluster.
>
> Appart from that, `rdd.pipe` seems like a good approach. Here is the
> relevant doc comment (in RDD.scala) on how to use it:
>
>  Return an RDD created by piping elements to a forked external
> process. The resulting RDD
>* is computed by executing the given process once per partition. All
> elements
>* of each input partition are written to a process's stdin as lines
> of input separated
>* by a newline. The resulting partition consists of the process's
> stdout output, with
>* each line of stdout resulting in one element of the output
> partition. A process is invoked
>* even for empty partitions.
>*
>* [...]
> Check the full docs here
> http://spark.apache.org/docs/latest/api/scala/index.html#
> org.apache.spark.rdd.RDD@pipe(command:String):org.apache.
> spark.rdd.RDD[String]
>
> This is how you could use it:
>
> productRDD=//get from cassandra
> processedRDD=productsRDD.map(STEP1).map(STEP2).pipe(C binary of step
> 3)
> STEP4 //store processed RDD
>
> hope this gives you some pointers,
>
> best,
> --Jakob
>
>
>
>
> On Thu, Sep 22, 2016 at 2:10 AM, Shashikant Kulkarni (शशिकांत
> कुलकर्णी)  wrote:
> > Hello Jakob,
> >
> > Thanks for replying. Here is a short example of what I am trying. Taking
> an
> > example of Product column family in Cassandra just for explaining my
> > requirement
> >
> > In Driver.java
> > {
> >  JavaRDD productsRdd = Get Products from Cassandra;
> >  productsRdd.map(ProductHelper.processProduct());
> > }
> >
> > in ProductHelper.java
> > {
> >
> > public static Function processProduct() {
> > return new Function< Product, Boolean>(){
> > private static final long serialVersionUID = 1L;
> >
> > @Override
> > public Boolean call(Product product) throws Exception {
> > //STEP 1: Doing some processing on product object.
> > //STEP 2: Now using few values of product, I need to create a string like
> > "name id sku datetime"
> > //STEP 3: Pass this string to my C binary file to perform some complex
> > calculations and return some data
> > //STEP 4: Get the return data and store it back in Cassandra DB
> > }
> > };
> > }
> > }
> >
> > In this ProductHelper, I cannot pass and don't want to pass sparkContext
> > object as app will throw error of "task not serializable". If there is a
> way
> > let me know.
> >
> > Now I am not able to achieve STEP 3 above. How can I pass a String to C
> > binary and get the output back in my program. The C binary reads data
> from
> > STDIN and outputs data to STDOUT. It is working from other part of
> > application from PHP. I want to reuse the same C binary in my Apache
> SPARK
> > application for some background processing and analysis using
> JavaRDD.pipe()
> > API. If there is any other way let me know. This code will be executed in
> > all the nodes in a cluster.
> >
> > Hope my requirement is now clear. How to do this?
> >
> > Regards,
> > Shash
> >
> > On Thu, Sep 22, 2016 at 4:13 AM, Jakob Odersky 
> wrote:
> >>
> >> Can you provide more details? It's unclear what you're asking
> >>
> >> On Wed, Sep 21, 2016 at 10:14 AM, shashikant.kulka...@gmail.com
> >>  wrote:
> >> > Hi All,
> >> >
> >> > I am trying to use the JavaRDD.pipe() API.
> >> >
> >> > I have one object with me from the JavaRDD
> >
> >
>

UDAF collect_list: Hive Query or spark sql expression

2016-09-23 Thread Jason Mop

Hi Spark Team,

I see most Hive function have been implemented by Spark SQL expression, but
collect_list is still using Hive Query, will it also be implemented by
Expression in future? any update?

Cheers,
Ming

Re: Spark Yarn Cluster with Reference File

2016-09-23 Thread Steve Loughran


On 23 Sep 2016, at 08:33, ABHISHEK 
> wrote:

at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: 
hdfs:/abc.com:8020/user/abhietc/abc.drl
 (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.(FileInputStream.java:146)
at 
org.drools.core.io.impl.FileSystemResource.getInputStream(FileSystemResource.java:123)
at 
org.drools.compiler.kie.builder.impl.KieFileSystemImpl.write(KieFileSystemImpl.java:58)


Looks like this .KieFileSystemImpl class only works with local files, so when 
it gets an HDFS path in it tries to open it and gets confused.

you may need to write to a local FS temp file then copy it into HDFS

Re: Spark Yarn Cluster with Reference File

2016-09-23 Thread Aditya


Hi Abhishek,

From your spark-submit it seems your passing the file as a parameter to 
the driver program. So now it depends what exactly you are doing with 
that parameter. Using --files option it will be available to all the 
worker nodes but if in your code if you are referencing using the 
specified path in distributed mode it wont get the file on the worker nodes.


If you can share the snippet of code it will be easy to debug.

On Friday 23 September 2016 01:03 PM, ABHISHEK wrote:

Hello there,

I have Spark Application which refer to an external file ‘abc.drl’ and 
having unstructured data.
Application is able to find this reference file if I  run app in Local 
mode but in Yarn with Cluster mode, it is not able to  find the file 
in the specified path.
I tried with both local and hdfs path with –-files option but it 
didn’t work.



What is working ?
1.Current  Spark Application runs fine if I run it in Local mode as 
mentioned below.

In below command   file path is local path not HDFS.
spark-submit --master local[*]  --class "com.abc.StartMain" 
abc-0.0.1-SNAPSHOT-jar-with-dependencies.jar /home/abhietc/abc/abc.drl


3.I want to run this Spark application using Yarn with cluster mode.
For that, I used below command but application is not able to find the 
path for the reference file abc.drl.I tried giving both local and HDFS 
path but didn’t work.


spark-submit --master yarn --deploy-mode cluster  --files 
/home/abhietc/abc/abc.drl --class com.abc.StartMain 
abc-0.0.1-SNAPSHOT-jar-with-dependencies.jar /home/abhietc/abc/abc.drl


spark-submit --master yarn --deploy-mode cluster  --files 
hdfs://abhietc.com:8020/user/abhietc/abc.drl 
 --class 
com.abc.StartMain abc-0.0.1-SNAPSHOT-jar-with-dependencies.jar 
hdfs://abhietc.com:8020/user/abhietc/abc.drl 



spark-submit --master yarn --deploy-mode cluster  --files 
hdfs://abc.com:8020/tmp/abc.drl  
--class com.abc.StartMain abc-0.0.1-SNAPSHOT-jar-with-dependencies.jar 
hdfs://abc.com:8020/tmp/abc.drl 



Error Messages:
Surprising we are not doing any Write operation on reference file but 
still log shows that application is trying to write to file instead 
reading the file.

Also log shows File not found exception for both HDFS and Local path.
-
16/09/20 14:49:50 ERROR scheduler.JobScheduler: Error running job 
streaming job 1474363176000 ms.0
org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 
in stage 1.0 (TID 4, abc.com ): 
java.lang.RuntimeException: Unable to write Resource: 
FileResource[file=hdfs:/abc.com:8020/user/abhietc/abc.drl 
]
at 
org.drools.compiler.kie.builder.impl.KieFileSystemImpl.write(KieFileSystemImpl.java:71)
at 
com.hmrc.taxcalculator.KieSessionFactory$.getNewSession(KieSessionFactory.scala:49)
at 
com.hmrc.taxcalculator.KieSessionFactory$.getKieSession(KieSessionFactory.scala:21)
at 
com.hmrc.taxcalculator.KieSessionFactory$.execute(KieSessionFactory.scala:27)
at 
com.abc.StartMain$$anonfun$main$1$$anonfun$4.apply(TaxCalculatorMain.scala:124)
at 
com.abc.StartMain$$anonfun$main$1$$anonfun$4.apply(TaxCalculatorMain.scala:124)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)

at org.apache.spark.scheduler.Task.run(Task.scala:89)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: 
hdfs:/abc.com:8020/user/abhietc/abc.drl 
 (No such file or directory)

at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.(FileInputStream.java:146)
at 
org.drools.core.io.impl.FileSystemResource.getInputStream(FileSystemResource.java:123)
at

Re: How to specify file

2016-09-23 Thread Mich Talebzadeh

You can do the following with option("delimiter") ..

val df = spark.read.option("header",
false).option("delimiter","\t").csv("hdfs://rhes564:9000/tmp/nw_10124772.tsv")

HTH

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 23 September 2016 at 07:56, Sea <261810...@qq.com> wrote:

> Hi, I want to run sql directly on files, I find that spark has supported
> sql like select * from csv.`/path/to/file`, but files may not be split by
> ','. Maybe it is split by '\001', how can I specify delimiter?
>
> Thank you!
>
>
>

Spark Yarn Cluster with Reference File

2016-09-23 Thread ABHISHEK

Hello there,

I have Spark Application which refer to an external file ‘abc.drl’ and
having unstructured data.
Application is able to find this reference file if I  run app in Local mode
but in Yarn with Cluster mode, it is not able to  find the file in the
specified path.
I tried with both local and hdfs path with –-files option but it didn’t
work.


What is working ?
1. Current  Spark Application runs fine if I run it in Local mode as
mentioned below.
In below command   file path is local path not HDFS.
spark-submit --master local[*]  --class "com.abc.StartMain"
abc-0.0.1-SNAPSHOT-jar-with-dependencies.jar /home/abhietc/abc/abc.drl

3. I want to run this Spark application using Yarn with cluster mode.
For that, I used below command but application is not able to find the path
for the reference file abc.drl.I tried giving both local and HDFS path but
didn’t work.

spark-submit --master yarn --deploy-mode cluster  --files
/home/abhietc/abc/abc.drl --class com.abc.StartMain
abc-0.0.1-SNAPSHOT-jar-with-dependencies.jar /home/abhietc/abc/abc.drl

spark-submit --master yarn --deploy-mode cluster  --files hdfs://
abhietc.com:8020/user/abhietc/abc.drl --class com.abc.StartMain
abc-0.0.1-SNAPSHOT-jar-with-dependencies.jar hdfs://
abhietc.com:8020/user/abhietc/abc.drl

spark-submit --master yarn --deploy-mode cluster  --files hdfs://
abc.com:8020/tmp/abc.drl --class com.abc.StartMain
abc-0.0.1-SNAPSHOT-jar-with-dependencies.jar hdfs://abc.com:8020/tmp/abc.drl


Error Messages:
Surprising we are not doing any Write operation on reference file but still
log shows that application is trying to write to file instead reading the
file.
Also log shows File not found exception for both HDFS and Local path.
-
16/09/20 14:49:50 ERROR scheduler.JobScheduler: Error running job streaming
job 1474363176000 ms.0
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage
1.0 (TID 4, abc.com): java.lang.RuntimeException: Unable to write Resource:
FileResource[file=hdfs:/abc.com:8020/user/abhietc/abc.drl]
at
org.drools.compiler.kie.builder.impl.KieFileSystemImpl.write(KieFileSystemImpl.java:71)
at
com.hmrc.taxcalculator.KieSessionFactory$.getNewSession(KieSessionFactory.scala:49)
at
com.hmrc.taxcalculator.KieSessionFactory$.getKieSession(KieSessionFactory.scala:21)
at
com.hmrc.taxcalculator.KieSessionFactory$.execute(KieSessionFactory.scala:27)
at
com.abc.StartMain$$anonfun$main$1$$anonfun$4.apply(TaxCalculatorMain.scala:124)
at
com.abc.StartMain$$anonfun$main$1$$anonfun$4.apply(TaxCalculatorMain.scala:124)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: hdfs:/
abc.com:8020/user/abhietc/abc.drl (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.(FileInputStream.java:146)
at
org.drools.core.io.impl.FileSystemResource.getInputStream(FileSystemResource.java:123)
at
org.drools.compiler.kie.builder.impl.KieFileSystemImpl.write(KieFileSystemImpl.java:58)
... 19 more
--
Cheers,
Abhishek

?????? How to specify file

2016-09-23 Thread Sea

Hi, Hemant, Aditya:
I don't want to create temp table and write code, I just want to run sql 
directly on files "select * from csv.`/path/to/file`"





--  --
??: "Hemant Bhanawat";;
: 2016??9??23??(??) 3:32
??: "Sea"<261810...@qq.com>; 
: "user"; 
: Re: How to specify file




Check out the READEME on the following page. This is the csv connector that you 
are using. I think you need to specify the delimiter option.  

https://github.com/databricks/spark-csv

Hemant Bhanawat

www.snappydata.io 







 
On Fri, Sep 23, 2016 at 12:26 PM, Sea <261810...@qq.com> wrote:
Hi, I want to run sql directly on files, I find that spark has supported sql 
like select * from csv.`/path/to/file`, but files may not be split by ','. 
Maybe it is split by '\001', how can I specify delimiter?

Thank you!

Re: How to specify file

2016-09-23 Thread Aditya


Hi Sea,

For using Spark SQL you will need to create DataFrame from the file and 
then execute select * on dataframe.

In your case you will need to do something like this

JavaRDD DF = context.textFile("path");
JavaRDD rowRDD3 = DF.map(new Function() {
public Row call(String record) throws Exception {
String[] fields = record.split("\001");
Row createRow = createRow(fields);
return createRow;
}
});
DataFrame ResultDf3 = hiveContext.createDataFrame(rowRDD3, schema);
ResultDf3.registerTempTable("test")
hiveContext.sql("select * from test");

You will need to create schema for the file first just like how you have 
created for csv file.





On Friday 23 September 2016 12:26 PM, Sea wrote:
Hi, I want to run sql directly on files, I find that spark has 
supported sql like select * from csv.`/path/to/file`, but files may 
not be split by ','. Maybe it is split by '\001', how can I specify 
delimiter?


Thank you!

Re: How to specify file

2016-09-23 Thread Hemant Bhanawat

Check out the READEME on the following page. This is the csv connector that
you are using. I think you need to specify the delimiter option.

https://github.com/databricks/spark-csv

Hemant Bhanawat 
www.snappydata.io

On Fri, Sep 23, 2016 at 12:26 PM, Sea <261810...@qq.com> wrote:

> Hi, I want to run sql directly on files, I find that spark has supported
> sql like select * from csv.`/path/to/file`, but files may not be split by
> ','. Maybe it is split by '\001', how can I specify delimiter?
>
> Thank you!
>
>
>

How to specify file

2016-09-23 Thread Sea

Hi, I want to run sql directly on files, I find that spark has supported sql 
like select * from csv.`/path/to/file`, but files may not be split by ','. 
Maybe it is split by '\001', how can I specify delimiter?

Thank you!

Re: Spark RDD and Memory

2016-09-23 Thread Aditya

Hi Datta,

Thanks for the reply.

If I havent cached any rdd and the data that is being loaded into memory 
after performing some operations exceeds the memory, how it is handled 
by spark.
Is previosly loaded rdds removed from memory to make it free for 
subsequent steps in DAG?

I am running into an issue where my DAG is very long and all the data 
does not fits into memory and at some point all my executors gets lost.

On Friday 23 September 2016 12:15 PM, Aditya wrote:

Hi Datta,

Thanks for the reply.

If I havent cached any rdd and the data that is being loaded into 
memory after performing some operations exceeds the memory, how it is 
handled by spark.
Is previosly loaded rdds removed from memory to make it free for 
subsequent steps in DAG?

I am running into an issue where my DAG is very long and all the data 
does not fits into memory and at some point all my executors gets lost.

On Friday 23 September 2016 12:02 PM, Datta Khot wrote:

Hi Aditya,

If you cache the RDDs - like textFile.cache(), 
textFile1().cache() - then it will not load the data again from file 
system.

Once done with related operations it is recommended to uncache the 
RDDs to manage memory efficiently and avoid it's exhaustion.

Note caching operation is with main memory and persist is to disk.

Datta
https://in.linkedin.com/in/datta-khot-240b544
http://www.datasherpa.io/

On Fri, Sep 23, 2016 at 10:23 AM, Aditya 
> wrote:

Thanks for the reply.

One more question.
How spark handles data if it does not fit in memory? The answer
which I got is that it flushes the data to disk and handle the
memory issue.
Plus in below example.
val textFile = sc.textFile("/user/emp.txt")
val textFile1 = sc.textFile("/user/emp1.xt")
val join = textFile.join(textFile1)
join.saveAsTextFile("/home/output")
val count = join.count()

When the first action is performed it loads textFile and
textFile1 in memory, performes join and save the result.
But when the second action (count) is called, it again loads
textFile and textFile1 in memory and again performs the join
operation?
If it loads again what is the correct way to prevent it from
loading again again the same data?

On Thursday 22 September 2016 11:12 PM, Mich Talebzadeh wrote:

Hi,

unpersist works on storage memory not execution memory. So I do
not think you can flush it out of memory if you have not cached
it using cache or something like below in the first place.

s.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)

s.unpersist

I believe the recent versions of Spark deploy Least Recently
Used (LRU) mechanism to flush unused data out of memory much
like RBMS cache management. I know LLDAP does that.

HTH

Dr Mich Talebzadeh

LinkedIn

/https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk.Any and all responsibility
for any loss, damage or destruction of data or any other
property which may arise from relying on this
email's technical content is explicitly disclaimed. The author
will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 22 September 2016 at 18:09, Hanumath Rao Maduri
 wrote:

Hello Aditya,

After an intermediate action has been applied you might want
to call rdd.unpersist() to let spark know that this rdd is
no longer required.

Thanks,
-Hanu

On Thu, Sep 22, 2016 at 7:54 AM, Aditya
> wrote:

Hi,

Suppose I have two RDDs
val textFile = sc.textFile("/user/emp.txt")
val textFile1 = sc.textFile("/user/emp1.xt")

Later I perform a join operation on above two RDDs
val join = textFile.join(textFile1)

And there are subsequent transformations without
including textFile and textFile1 further and an action
to start the execution.

When action is called, textFile and textFile1 will be
loaded in memory first. Later join will be performed and
kept in memory.
My question is once join is there memory and is used for
subsequent execution, what happens to textFile and
textFile1 RDDs. Are they still kept in memory untill the
full lineage graph is completed or is it destroyed once
its use is over? If it is kept in memory, is there any
way I can explicitly remove it from memory to free the
memory?

-

Re: Spark RDD and Memory

2016-09-23 Thread Datta Khot

Hi Aditya,

If you cache the RDDs - like textFile.cache(), textFile1().cache() - then
it will not load the data again from file system.

Once done with related operations it is recommended to uncache the RDDs to
manage memory efficiently and avoid it's exhaustion.

Note caching operation is with main memory and persist is to disk.

Datta
https://in.linkedin.com/in/datta-khot-240b544
http://www.datasherpa.io/

On Fri, Sep 23, 2016 at 10:23 AM, Aditya  wrote:

> Thanks for the reply.
>
> One more question.
> How spark handles data if it does not fit in memory? The answer which I
> got is that it flushes the data to disk and handle the memory issue.
> Plus in below example.
> val textFile = sc.textFile("/user/emp.txt")
> val textFile1 = sc.textFile("/user/emp1.xt")
> val join = textFile.join(textFile1)
> join.saveAsTextFile("/home/output")
> val count = join.count()
>
> When the first action is performed it loads textFile and textFile1 in
> memory, performes join and save the result.
> But when the second action (count) is called, it again loads textFile and
> textFile1 in memory and again performs the join operation?
> If it loads again what is the correct way to prevent it from loading again
> again the same data?
>
>
> On Thursday 22 September 2016 11:12 PM, Mich Talebzadeh wrote:
>
> Hi,
>
> unpersist works on storage memory not execution memory. So I do not think
> you can flush it out of memory if you have not cached it using cache or
> something like below in the first place.
>
> s.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)
>
> s.unpersist
>
> I believe the recent versions of Spark deploy Least Recently Used
> (LRU) mechanism to flush unused data out of memory much like RBMS cache
> management. I know LLDAP does that.
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn *
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 22 September 2016 at 18:09, Hanumath Rao Maduri 
> wrote:
>
>> Hello Aditya,
>>
>> After an intermediate action has been applied you might want to call
>> rdd.unpersist() to let spark know that this rdd is no longer required.
>>
>> Thanks,
>> -Hanu
>>
>> On Thu, Sep 22, 2016 at 7:54 AM, Aditya > co.in> wrote:
>>
>>> Hi,
>>>
>>> Suppose I have two RDDs
>>> val textFile = sc.textFile("/user/emp.txt")
>>> val textFile1 = sc.textFile("/user/emp1.xt")
>>>
>>> Later I perform a join operation on above two RDDs
>>> val join = textFile.join(textFile1)
>>>
>>> And there are subsequent transformations without including textFile and
>>> textFile1 further and an action to start the execution.
>>>
>>> When action is called, textFile and textFile1 will be loaded in memory
>>> first. Later join will be performed and kept in memory.
>>> My question is once join is there memory and is used for subsequent
>>> execution, what happens to textFile and textFile1 RDDs. Are they still kept
>>> in memory untill the full lineage graph is completed or is it destroyed
>>> once its use is over? If it is kept in memory, is there any way I can
>>> explicitly remove it from memory to free the memory?
>>>
>>>
>>>
>>>
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>
>
>

Re: Redshift Vs Spark SQL (Thrift)

2016-09-23 Thread ayan guha

Thanks, but here is my argument that they may not be seen as different
purpose: I am thinking both Redshift and Hive as a data warehousing
solutions, with STS as a mechanism to lift hive's performance (if Tez or
LLAP can provide similar performance, I am fine to use Hive Thrift Server
as well).

So, I want to build and implement core DWH models (read star schema) on
Hive and expose it through some non-mapreduce framework.

Does that make sense? Or I am thinking in wrong path and Redshift can do
something more?

On Fri, Sep 23, 2016 at 4:18 PM, Jörn Franke  wrote:

> Depends what your use case is. A generic benchmark does not make sense,
> because they are different technologies for different purposes.
>
> On 23 Sep 2016, at 06:09, ayan guha  wrote:
>
> Hi
>
> Is there any benchmark or point of view in terms of pros and cons between
> AWS Redshift vs Spark SQL through STS?
>
> --
> Best Regards,
> Ayan Guha
>
>

-- 
Best Regards,
Ayan Guha

Re: Redshift Vs Spark SQL (Thrift)

2016-09-23 Thread Jörn Franke

Depends what your use case is. A generic benchmark does not make sense, because 
they are different technologies for different purposes.

> On 23 Sep 2016, at 06:09, ayan guha  wrote:
> 
> Hi
> 
> Is there any benchmark or point of view in terms of pros and cons between AWS 
> Redshift vs Spark SQL through STS?
> 
> -- 
> Best Regards,
> Ayan Guha

41 matches

Mail list logo