Re: removing header from csv file

2016-04-26 Thread nihed mbarek
You can add a filter with string that you are sure available only in the header Le mercredi 27 avril 2016, Divya Gehlot a écrit : > yes you can remove the headers by removing the first row > > can first() or head() to do that > > > Thanks, > Divya > > On 27 April 2016

Re: removing header from csv file

2016-04-26 Thread Divya Gehlot
yes you can remove the headers by removing the first row can first() or head() to do that Thanks, Divya On 27 April 2016 at 13:24, Ashutosh Kumar wrote: > I see there is a library spark-csv which can be used for removing header > and processing of csv files. But it

Re: removing header from csv file

2016-04-26 Thread Praveen Devarao
Hi Ashutosh, Could you give more details as to what you are wanting do and in what feature of Spark you want use? Yes, spark-csv is a connector for SparkSQL module...hence it works with SQLContext only. Thanking You

Re: removing header from csv file

2016-04-26 Thread Takeshi Yamamuro
Hi, What do u mean "with sqlcontext only"? You mean you'd like to load csv data as rdd (sparkcontext) or something? // maropu On Wed, Apr 27, 2016 at 2:24 PM, Ashutosh Kumar wrote: > I see there is a library spark-csv which can be used for removing header > and

removing header from csv file

2016-04-26 Thread Ashutosh Kumar
I see there is a library spark-csv which can be used for removing header and processing of csv files. But it seems it works with sqlcontext only. Is there a way to remove header from csv files without sqlcontext ? Thanks Ashutosh

Re: Cant join same dataframe twice ?

2016-04-26 Thread Takeshi Yamamuro
Based on my example, how about renaming columns? val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") val df3 = df1.join(df2, "a").select($"a", df1("b").as("1-b"), df2("b").as("2-b")) val df4 = df3.join(df2, df3("2-b") === df2("b")) // maropu

Re: Cant join same dataframe twice ?

2016-04-26 Thread Divya Gehlot
Correct Takeshi Even I am facing the same issue . How to avoid the ambiguity ? On 27 April 2016 at 11:54, Takeshi Yamamuro wrote: > Hi, > > I tried; > val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") > val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") > val df3 =

Re: Streaming K-means not printing predictions

2016-04-26 Thread Prashant Sharma
Since you are reading from file stream, I would suggest instead of printing try to save it on a file. There may be output the first time and then no data in subsequent iterations. Prashant Sharma On Tue, Apr 26, 2016 at 7:40 PM, Ashutosh Kumar wrote: > I created a

Re: Save RDD to HDFS using Spark Python API

2016-04-26 Thread Prashant Sharma
What Davies said is correct, second argument is hadoop's output format. Hadoop supports many type of output format's and all of them have their own advantages. Apart from the one specified above, https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html

Re: Cant join same dataframe twice ?

2016-04-26 Thread Takeshi Yamamuro
Yeah, I think so. This is a kind of common mistakes. // maropu On Wed, Apr 27, 2016 at 1:05 PM, Ted Yu wrote: > The ambiguity came from: > > scala> df3.schema > res0: org.apache.spark.sql.types.StructType = > StructType(StructField(a,IntegerType,false), >

Re: Cant join same dataframe twice ?

2016-04-26 Thread Ted Yu
The ambiguity came from: scala> df3.schema res0: org.apache.spark.sql.types.StructType = StructType(StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(b,IntegerType,false)) On Tue, Apr 26, 2016 at 8:54 PM, Takeshi Yamamuro wrote: > Hi, > > I

Re: Cant join same dataframe twice ?

2016-04-26 Thread Takeshi Yamamuro
Hi, I tried; val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") val df3 = df1.join(df2, "a") val df4 = df3.join(df2, "b") And I got; org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: b#6, b#14.; If same case, this

Authorization Support(on all operations not only DDL) in Spark Sql

2016-04-26 Thread our...@cnsuning.com
hi rxin, Will Spark sql Support Authorization not only DDL ? In my user case ,a hive table was granted read to userA and other user don't have permission to read , but userB can read this hive table using spark sql. Ricky Ou

Re: JavaSparkContext.wholeTextFiles read directory

2016-04-26 Thread Mail.com
wholeTextFiles() works. It is just that it does not provide the parallelism. This is on Spark 1.4. HDP 2.3.2. Batch jobs. Thanks > On Apr 26, 2016, at 9:16 PM, Harjit Singh > wrote: > > You will have to write your customReceiver to do that. I don’t think >

Re: JavaSparkContext.wholeTextFiles read directory

2016-04-26 Thread Harjit Singh
You will have to write your customReceiver to do that. I don’t think wholeTextFile is designed for that. - Harjit > On Apr 26, 2016, at 7:19 PM, Mail.com wrote: > > > Hi All, > I am reading entire directory of gz XML files with wholeTextFiles. > > I understand as it

Re: Save RDD to HDFS using Spark Python API

2016-04-26 Thread Davies Liu
hdfs://192.168.10.130:9000/dev/output/test already exists, so you need to remove it first. On Tue, Apr 26, 2016 at 5:28 AM, Luke Adolph wrote: > Hi, all: > Below is my code: > > from pyspark import * > import re > > def getDateByLine(input_str): > str_pattern =

Re: EOFException while reading from HDFS

2016-04-26 Thread Davies Liu
The Spark package you are using is packaged with Hadoop 2.6, but the HDFS is Hadoop 1.0.4, they are not compatible. On Tue, Apr 26, 2016 at 11:18 AM, Bibudh Lahiri wrote: > Hi, > I am trying to load a CSV file which is on HDFS. I have two machines: > IMPETUS-1466

JavaSparkContext.wholeTextFiles read directory

2016-04-26 Thread Mail.com
Hi All, I am reading entire directory of gz XML files with wholeTextFiles. I understand as it is gz and with wholeTextFiles the individual files are not splittable but why the entire directory is read by one executor, single task. I have provided number of executors as number of files in that

Re: Cant join same dataframe twice ?

2016-04-26 Thread Prasad Ravilla
Also, check the column names of df1 ( after joining df2 and df3 ). Prasad. From: Ted Yu Date: Monday, April 25, 2016 at 8:35 PM To: Divya Gehlot Cc: "user @spark" Subject: Re: Cant join same dataframe twice ? Can you show us the structure of df2 and df3 ? Thanks On Mon, Apr 25, 2016 at 8:23

Last RDD always being Run

2016-04-26 Thread Harjit Singh
i'm running the LogAnalyzerStreaming Example. Its processing the files fine but it keeps emitting the output of last processed RDD until it gets a new one. Is there a way to prevent that. I'm planning to use this example in a real scenario where when I have processed the data, I would be

test

2016-04-26 Thread Harjit Singh
signature.asc Description: Message signed with OpenPGP using GPGMail

EOFException while reading from HDFS

2016-04-26 Thread Bibudh Lahiri
Hi, I am trying to load a CSV file which is on HDFS. I have two machines: IMPETUS-1466 (172.26.49.156) and IMPETUS-1325 (172.26.49.55). Both have Spark 1.6.0 pre-built for Hadoop 2.6 and later, but for both, I had existing Hadoop clusters running Hadoop 1.0.4. I have launched HDFS from

Re: Reading from Amazon S3

2016-04-26 Thread Ted Yu
Looking at the cause of the error, it seems hadoop-aws-xx.jar (corresponding to the version of hadoop you use) was missing in classpath. FYI On Tue, Apr 26, 2016 at 9:06 AM, Jinan Alhajjaj wrote: > Hi All, > I am trying to read a file stored in Amazon S3. > I wrote

Re: Fill Gaps between rows

2016-04-26 Thread Andrés Ivaldi
Thanks Sebastian, I've never understood that part of Hive Context, so It's possible to use HiveContext then use the window functions and save dataFrame into another source like MSSQL, Oracle, or any with JDBC ? Regards. On Tue, Apr 26, 2016 at 1:22 PM, Sebastian Piu

Re: Fill Gaps between rows

2016-04-26 Thread Sebastian Piu
Yes you need hive Context for the window functions, but you don't need hive for it to work On Tue, 26 Apr 2016, 14:15 Andrés Ivaldi, wrote: > Hello, do exists an Out Of the box for fill in gaps between rows with a > given condition? > As example: I have a source table with

Reading from Amazon S3

2016-04-26 Thread Jinan Alhajjaj
Hi All,I am trying to read a file stored in Amazon S3.I wrote this code:import java.util.List; import java.util.Scanner; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import

Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Hyukjin Kwon
EDIT: not mapper but a task for HadoopRDD maybe as far as I know. I think the most clear way is just to run a job on multiple files with the API and check the number of tasks in the job. On 27 Apr 2016 12:06 a.m., "Hyukjin Kwon" wrote: wholeTextFile() API uses

Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Hyukjin Kwon
wholeTextFile() API uses WholeTextFileInputFormat, https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala, which returns false for isSplittable. In this case, only single mapper appears for the entire

Re: Spark SQL query for List

2016-04-26 Thread Ramkumar V
I'm getting following exception if i form a query like this. Its not coming to the point where get(0) or get(1). Exception in thread "main" java.lang.RuntimeException: [1.22] failure: ``*'' expected but `cities' found *Thanks*, On Tue, Apr 26, 2016 at

Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Vadim Vararu
Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3 , etc. Spark supports text files, SequenceFiles

Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Hyukjin Kwon
And also https://spark.apache.org/docs/1.6.0/programming-guide.html If the file is single file, then this would not be distributed. On 26 Apr 2016 11:52 p.m., "Ted Yu" wrote: > Please take a look at: > core/src/main/scala/org/apache/spark/SparkContext.scala > >* Do `val

Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Ted Yu
Please take a look at: core/src/main/scala/org/apache/spark/SparkContext.scala * Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`, * * then `rdd` contains * {{{ * (a-hdfs-path/part-0, its content) * (a-hdfs-path/part-1, its content) * ... *

Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Vadim Vararu
Hi guys, I'm trying to read many filed from s3 using JavaSparkContext.wholeTextFiles(...). Is that executed in a distributed manner? Please give me a link to the place in documentation where it's specified. Thanks, Vadim.

Re: Kafka exception in Apache Spark

2016-04-26 Thread Cody Koeninger
That error indicates a message bigger than the buffer's capacity https://issues.apache.org/jira/browse/KAFKA-1196 On Tue, Apr 26, 2016 at 3:07 AM, Michel Hubert wrote: > Hi, > > > > > > I use a Kafka direct stream approach. > > My Spark application was running ok. > > This

Re: Spark SQL query for List

2016-04-26 Thread Hyukjin Kwon
Doesn't get(0) give you the Array[String] for CITY (am I missing something?) On 26 Apr 2016 11:02 p.m., "Ramkumar V" wrote: JavaSparkContext ctx = new JavaSparkContext(sparkConf); SQLContext sqlContext = new SQLContext(ctx); DataFrame parquetFile =

Streaming K-means not printing predictions

2016-04-26 Thread Ashutosh Kumar
I created a Streaming k means based on scala example. It keeps running without any error but never prints predictions Here is Log 19:15:05,050 INFO org.apache.spark.streaming.scheduler.InputInfoTracker - remove old batch metadata: 146167824 ms 19:15:10,001 INFO

Re: Spark SQL query for List

2016-04-26 Thread Ramkumar V
JavaSparkContext ctx = new JavaSparkContext(sparkConf); SQLContext sqlContext = new SQLContext(ctx); DataFrame parquetFile = sqlContext.parquetFile( "hdfs:/XYZ:8020/user/hdfs/parquet/*.parquet"); parquetFile.registerTempTable("parquetFile"); DataFrame tempDF =

Re: Spark SQL query for List

2016-04-26 Thread Hyukjin Kwon
Could you maybe share your codes? On 26 Apr 2016 9:51 p.m., "Ramkumar V" wrote: > Hi, > > I had loaded JSON file in parquet format into SparkSQL. I can't able to > read List which is inside JSON. > > Sample JSON > > { > "TOUR" : { > "CITIES" :

Fill Gaps between rows

2016-04-26 Thread Andrés Ivaldi
Hello, do exists an Out Of the box for fill in gaps between rows with a given condition? As example: I have a source table with data and a column with the day number, but the record only register a event and no necessary all days have events, so the table no necessary has all days. But I want a

how to make task is assigned after all executors are launched

2016-04-26 Thread Qian Huang
Hi all, Since the data I want to process is not on HDFS, I try to use sc.makeRDD() to ensure all items of a partition is located on one node, then the task can be launched on that node. Now comes the problem, sometimes, the task is already assigned to some executors, then other executors are

Spark SQL query for List

2016-04-26 Thread Ramkumar V
Hi, I had loaded JSON file in parquet format into SparkSQL. I can't able to read List which is inside JSON. Sample JSON { "TOUR" : { "CITIES" : ["Paris","Berlin","Prague"] }, "BUDJET" : 100 } I want to read value of CITIES. *Thanks*,

Save RDD to HDFS using Spark Python API

2016-04-26 Thread Luke Adolph
Hi, all: Below is my code: from pyspark import *import re def getDateByLine(input_str): str_pattern = '^\d{4}-\d{2}-\d{2}' pattern = re.compile(str_pattern) match = pattern.match(input_str) if match: return match.group() else: return None file_url =

Re: Splitting spark dstream into separate fields

2016-04-26 Thread Mich Talebzadeh
; -- > > > > Thanks Praveen. > > With regard to key/value pair. My kafka takes the following rows as input > > cat ${IN_FILE} | ${KAFKA_HOME}/bin/kafka-console-producer.sh --broker-list > rhes564:9092 --topic newtopic > > That ${IN_FILE} is th

Re: Splitting spark dstream into separate fields

2016-04-26 Thread Praveen Devarao
E} is the source of prices (1000 as follows ID TIMESTAMP PRICE 40, 20160426-080924, 67.55738301621814598514 So tuples would be like below? (1,"ID") (2, "TIMESTAMP") (3, "PRICE") For values val words1 = lines.map(_.split(',').view(1)) val w

Unsubscribe

2016-04-26 Thread Andrew Heinrichs
Unsubscribe On Apr 22, 2016 3:21 PM, "Mich Talebzadeh" wrote: > > Hi, > > Anyone know which jar file has import org.apache.spark.internal.Logging? > > I tried *spark-core_2.10-1.5.1.jar * > > but does not seem to work > > scala> import

Re: Splitting spark dstream into separate fields

2016-04-26 Thread Mich Talebzadeh
PRICE 40, 20160426-080924, 67.55738301621814598514 So tuples would be like below? (1,"ID") (2, "TIMESTAMP") (3, "PRICE") For values val words1 = lines.map(_.split(',').view(1)) val words2 = lines.map(_.split(',').view(2)) val words3 = lines

RE: Kafka exception in Apache Spark

2016-04-26 Thread Michel Hubert
This is production. Van: Mich Talebzadeh [mailto:mich.talebza...@gmail.com] Verzonden: dinsdag 26 april 2016 12:01 Aan: Michel Hubert CC: user@spark.apache.org Onderwerp: Re: Kafka exception in Apache Spark Hi Michael, Is this production or test? Dr Mich Talebzadeh

Re: Splitting spark dstream into separate fields

2016-04-26 Thread Praveen Devarao
Hi Mich, >> val lines = dstream.map(_._2) This maps the record into components? Is that the correct understanding of it << Not sure what you refer to when said record into components. The above function is basically giving you the tuple

Re: Kafka exception in Apache Spark

2016-04-26 Thread Mich Talebzadeh
Hi Michael, Is this production or test? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 26 April

Kafka exception in Apache Spark

2016-04-26 Thread Michel Hubert
Hi, I use a Kafka direct stream approach. My Spark application was running ok. This morning we upgraded to CDH 5.7.0 And when I re-started my Spark application I get exceptions. It seems a problem with the direct stream approach. Any ideas how to fix this? User class threw exception:

Splitting spark dstream into separate fields

2016-04-26 Thread Mich Talebzadeh
Hi, Is there any optimum way of splitting a dstream into components? I am doing Spark streaming and this the dstream I get val dstream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics) Now that dstream consists of 10,00 price lines per

How can I bucketize / group a DataFrame from parquet files?

2016-04-26 Thread Brandon White
I am creating a dataFrame from parquet files. The schema is based on the parquet files, I do not know it before hand. What I want to do is group the entire DF into buckets based on a column. val df = sqlContext.read.parquet("/path/to/files") val groupedBuckets: DataFrame[String, Array[Rows]] =