Re: How to skip nonexistent file when read files with spark?

2018-05-21 Thread JF Chen
Thanks ayan, Also I have tried this method, the most tricky thing is that dataframe union method must be based on same structure schema, while on my files, the schema is variable. Regard, Junfeng Chen On Tue, May 22, 2018 at 10:33 AM, ayan guha wrote: > A relatively

Re: How to skip nonexistent file when read files with spark?

2018-05-21 Thread Thakrar, Jayesh
Junfeng, I would suggest preprocessing/validating the paths in plain Python (and not Spark) before you try to fetch data. I am not familiar with Python Hadoop libraries, but see if this helps - http://crs4.github.io/pydoop/tutorial/hdfs_api.html Best, Jayesh From: JF Chen

Re: How to skip nonexistent file when read files with spark?

2018-05-21 Thread JF Chen
Thanks Thakrar~ Regard, Junfeng Chen On Tue, May 22, 2018 at 11:22 AM, Thakrar, Jayesh < jthak...@conversantmedia.com> wrote: > Junfeng, > > > > I would suggest preprocessing/validating the paths in plain Python (and > not Spark) before you try to fetch data. > > I am not familiar with Python

Re: How to skip nonexistent file when read files with spark?

2018-05-21 Thread ayan guha
A relatively naive solution will be: 0. Create a dummy blank dataframe 1. Loop through the list of paths. 2. Try to create the dataframe from the path. If success then union it cumulatively. 3. If error, just ignore it or handle as you wish. At the end of the loop, just use the unioned df. This

Re: How to skip nonexistent file when read files with spark?

2018-05-21 Thread JF Chen
Thanks, Thakrar, I have tried to check the existence of path before read it, but HDFSCli python package seems not support wildcard. "FileSystem.globStatus" is a java api while I am using python via livy Do you know any python api implementing the same function? Regard, Junfeng Chen On

Spark Worker Re-register to Master

2018-05-21 Thread sushil.chaudhary
All, We have a problem with Spark Worker. The worker goes down whenever we are not able to get the spark master up and running before starting the worker. Of course- it does try to ReregisterWithMaster upto 16 attemps : 1. First 6 attempts it make in interval of appx 10 seconds 2. Next 10

Spark Worker Re-register to Master

2018-05-21 Thread sushil.chaudhary
All, We have a problem with Spark Worker. The worker goes down whenever we are not able to get the spark master up and running before starting the worker. Of course- it does try to ReregisterWithMaster upto 16 attemps : 1. First 6 attempts it make in interval of appx 10 seconds 2. Next 10

Re: Spark horizontal scaling is not supported in which cluster mode? Ask

2018-05-21 Thread Mark Hamstra
Horizontal scaling is scaling across multiple, distributed computers (or at least OS instances). Local mode is, therefore, by definition not horizontally scalable since it just uses a configurable number of local threads. If the question actually asked "which cluster manager...?", then I have a

Spark horizontal scaling is not supported in which cluster mode? Ask

2018-05-21 Thread unk1102
Hi I came by one Spark question which was about which spark cluster manager does not support horizontal scalability? Answer options were Mesos, Yarn, Standalone and local mode. I believe all cluster managers are horizontal scalable please correct if I am wrong. And I think answer is local mode. Is

Re: testing frameworks

2018-05-21 Thread Holden Karau
So I’m biased as the author of spark-testing-base but I think it’s pretty ok. Are you looking for unit or integration or something else? On Mon, May 21, 2018 at 5:24 AM Steve Pruitt wrote: > Hi, > > > > Can anyone recommend testing frameworks suitable for Spark jobs. >

Re: [Spark2.1] SparkStreaming to Cassandra performance problem

2018-05-21 Thread Russell Spitzer
The answer is most likely that when you use Cross Java - Python code you incur a penalty for every objects that you transform from a Java object into a Python object (and then back again to a Python object) when data is being passed in and out of your functions. A way around this would probably be

Re: How to skip nonexistent file when read files with spark?

2018-05-21 Thread Thakrar, Jayesh
Probably you can do some preprocessing/checking of the paths before you attempt to read it via Spark. Whether it is local or hdfs filesystem, you can try to check for existence and other details by using the "FileSystem.globStatus" method from the Hadoop API. From: JF Chen

help in copying data from one azure subscription to another azure subscription

2018-05-21 Thread amit kumar singh
HI Team, We are trying to move data between one azure subscription to another azure subscription is there a faster way to do through spark i am using distcp and its taking for ever thanks rohit

testing frameworks

2018-05-21 Thread Steve Pruitt
Hi, Can anyone recommend testing frameworks suitable for Spark jobs. Something that can be integrated into a CI tool would be great. Thanks.

Re: [Spark2.1] SparkStreaming to Cassandra performance problem

2018-05-21 Thread Alonso Isidoro Roman
The main language they developed spark with is scala, so all the new features go first to scala, java and finally python. I'm not surprised by the results, we've seen it on Stratio since the first versions of spark. At the beginning of development, some of our engineers make the prototype with

Adding jars

2018-05-21 Thread Malveeka Bhandari
Hi. Can I add jars to the spark executor classpath in a running context? Basically if I have a running spark session, if I edit the spark.jars in the middle of the code, will it pick up the changes? If not, is there any way to add new dependent jars to a running spark context ? We’re using Livy

Executors slow down when running on the same node

2018-05-21 Thread Javier Pareja
Hello, I have a Spark Streaming job reading data from kafka, processing it and inserting it into Cassandra. The job is running on a cluster with 3 machines. I use mesos to submit the job with 3 executors using 1 core each. The problem is that when all executors are running on the same node, the

Re: is it possible to create one KafkaDirectStream (Dstream) per topic?

2018-05-21 Thread Alonso Isidoro Roman
Check this thread . El lun., 21 may. 2018 a las 0:25, kant kodali () escribió: > Hi All, > > I have 5 Kafka topics and I am wondering if is even possible to create one > KafkaDirectStream