how to get variable type signature through the api

2017-04-21 Thread Matthew Purdy
When running spark from spark-shell, when each defined variable created the shell prints out the type signature of that variable along with the toString of the instance. how can i programmatically generated the same signature without using the shell (for debugging purposes) from a spark script or

Re: splitting a huge file

2017-04-21 Thread Roger Marin
If the file is in HDFS already you can use spark to read the file using a specific input format (depending on file type) to split it. http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/InputFormat.html On Sat, Apr 22, 2017 at 4:36 AM, Paul Tremblay wrote:

Re: question regarding pyspark

2017-04-21 Thread Pushkar.Gujar
Hi Afshin, If you need to associate header information from 2nd file to first one i.e. , you can do that with specifying custom schema. Below is example from spark-csv package. As you can guess, you will have to do some pre-processing to create customSchema by first reading second file . val

question regarding pyspark

2017-04-21 Thread Afshin, Bardia
I’m ingesting a CSV with hundreds of columns and the original CSV file it’self doesn’t have any header. I do have a separate file that is just the headers, is there a way to tell Spark API this information when loading the CSV file? Or do I have to do some preprocesisng before doing so?

Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

2017-04-21 Thread Gene Pang
Hi Georg, Yes, that should be possible with Alluxio. Tachyon was renamed to Alluxio. This article on how Alluxio is used for a Spark streaming use case may be helpful. Thanks, Gene On Fri, Apr

What is correct behavior for spark.task.maxFailures?

2017-04-21 Thread Chawla,Sumit
I am seeing a strange issue. I had a bad behaving slave that failed the entire job. I have set spark.task.maxFailures to 8 for my job. Seems like all task retries happen on the same slave in case of failure. My expectation was that task will be retried on different slave in case of failure, and

Re: splitting a huge file

2017-04-21 Thread Jörn Franke
What is your DWH technology? If the file is on HDFS and depending on the format than Spark can read parts of it in parallel. > On 21. Apr 2017, at 20:36, Paul Tremblay wrote: > > We are tasked with loading a big file (possibly 2TB) into a data warehouse. > In order to

Re: [sparkR] [MLlib] : Is word2vec implemented in SparkR MLlib ?

2017-04-21 Thread Felix Cheung
Not currently - how are you planning to use the output from word2vec? From: Radhwane Chebaane Sent: Thursday, April 20, 2017 4:30:14 AM To: user@spark.apache.org Subject: [sparkR] [MLlib] : Is word2vec implemented in SparkR MLlib ? Hi,

splitting a huge file

2017-04-21 Thread Paul Tremblay
We are tasked with loading a big file (possibly 2TB) into a data warehouse. In order to do this efficiently, we need to split the file into smaller files. I don't believe there is a way to do this with Spark, because in order for Spark to distribute the file to the worker nodes, it first has to

Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

2017-04-21 Thread Georg Heiler
You could write your views to hive or maybe tachyon. Is the periodically updated data big? Hemanth Gudela schrieb am Fr. 21. Apr. 2017 um 16:55: > Being new to spark, I think I need your suggestion again. > > > > #2 you can always define a batch Dataframe and

Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

2017-04-21 Thread Hemanth Gudela
Being new to spark, I think I need your suggestion again. #2 you can always define a batch Dataframe and register it as view, and then run a background then periodically creates a new Dataframe with updated data and re-registers it as a view with the same name I seem to have misunderstood your

Re: Azure Event Hub with Pyspark

2017-04-21 Thread ayan guha
Hi I need either all scala or all python(preferable). I will check the one suggested by Denny, thanks. On Fri, Apr 21, 2017 at 3:13 PM, Denny Lee wrote: > As well, perhaps another option could be to use the Spark Connector to > DocumentDB