Re: removing header from csv file

2016-04-26 Thread nihed mbarek
You can add a filter with string that you are sure available only in the
header

Le mercredi 27 avril 2016, Divya Gehlot  a écrit :

> yes you can remove the headers by removing the first row
>
> can first() or head() to do that
>
>
> Thanks,
> Divya
>
> On 27 April 2016 at 13:24, Ashutosh Kumar  > wrote:
>
>> I see there is a library spark-csv which can be used for removing header
>> and processing of csv files. But it seems it works with sqlcontext only. Is
>> there a way to remove header from csv files without sqlcontext ?
>>
>> Thanks
>> Ashutosh
>>
>
>

-- 

M'BAREK Med Nihed,
Fedora Ambassador, TUNISIA, Northern Africa
http://www.nihed.com




Re: removing header from csv file

2016-04-26 Thread Divya Gehlot
yes you can remove the headers by removing the first row

can first() or head() to do that


Thanks,
Divya

On 27 April 2016 at 13:24, Ashutosh Kumar  wrote:

> I see there is a library spark-csv which can be used for removing header
> and processing of csv files. But it seems it works with sqlcontext only. Is
> there a way to remove header from csv files without sqlcontext ?
>
> Thanks
> Ashutosh
>


Re: removing header from csv file

2016-04-26 Thread Praveen Devarao
Hi Ashutosh,

Could you give more details as to what you are wanting do and in 
what feature of Spark you want use? Yes, spark-csv is a connector for 
SparkSQL module...hence it works with SQLContext only.

Thanking You
-
Praveen Devarao
Spark Technology Centre
IBM India Software Labs
-
"Courage doesn't always roar. Sometimes courage is the quiet voice at the 
end of the day saying I will try again"



From:   Ashutosh Kumar 
To: "user @spark" 
Date:   27/04/2016 10:55 am
Subject:removing header from csv file



I see there is a library spark-csv which can be used for removing header 
and processing of csv files. But it seems it works with sqlcontext only. 
Is there a way to remove header from csv files without sqlcontext ? 

Thanks
Ashutosh





Re: removing header from csv file

2016-04-26 Thread Takeshi Yamamuro
Hi,

What do u mean "with sqlcontext only"?
You mean you'd like to load csv data as rdd (sparkcontext) or something?

// maropu

On Wed, Apr 27, 2016 at 2:24 PM, Ashutosh Kumar 
wrote:

> I see there is a library spark-csv which can be used for removing header
> and processing of csv files. But it seems it works with sqlcontext only. Is
> there a way to remove header from csv files without sqlcontext ?
>
> Thanks
> Ashutosh
>



-- 
---
Takeshi Yamamuro


removing header from csv file

2016-04-26 Thread Ashutosh Kumar
I see there is a library spark-csv which can be used for removing header
and processing of csv files. But it seems it works with sqlcontext only. Is
there a way to remove header from csv files without sqlcontext ?

Thanks
Ashutosh


Re: Cant join same dataframe twice ?

2016-04-26 Thread Takeshi Yamamuro
Based on my example, how about renaming columns?

val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
val df3 = df1.join(df2, "a").select($"a", df1("b").as("1-b"),
df2("b").as("2-b"))
val df4 = df3.join(df2, df3("2-b") === df2("b"))

// maropu

On Wed, Apr 27, 2016 at 1:58 PM, Divya Gehlot 
wrote:

> Correct Takeshi
> Even I am facing the same issue .
>
> How to avoid the ambiguity ?
>
>
> On 27 April 2016 at 11:54, Takeshi Yamamuro  wrote:
>
>> Hi,
>>
>> I tried;
>> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
>> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
>> val df3 = df1.join(df2, "a")
>> val df4 = df3.join(df2, "b")
>>
>> And I got; org.apache.spark.sql.AnalysisException: Reference 'b' is
>> ambiguous, could be: b#6, b#14.;
>> If same case, this message makes sense and this is clear.
>>
>> Thought?
>>
>> // maropu
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla 
>> wrote:
>>
>>> Also, check the column names of df1 ( after joining df2 and df3 ).
>>>
>>> Prasad.
>>>
>>> From: Ted Yu
>>> Date: Monday, April 25, 2016 at 8:35 PM
>>> To: Divya Gehlot
>>> Cc: "user @spark"
>>> Subject: Re: Cant join same dataframe twice ?
>>>
>>> Can you show us the structure of df2 and df3 ?
>>>
>>> Thanks
>>>
>>> On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot 
>>> wrote:
>>>
 Hi,
 I am using Spark 1.5.2 .
 I have a use case where I need to join the same dataframe twice on two
 different columns.
 I am getting error missing Columns

 For instance ,
 val df1 = df2.join(df3,"Column1")
 Below throwing error missing columns
 val df 4 = df1.join(df3,"Column2")

 Is the bug or valid scenario ?




 Thanks,
 Divya

>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


-- 
---
Takeshi Yamamuro


Re: Cant join same dataframe twice ?

2016-04-26 Thread Divya Gehlot
Correct Takeshi
Even I am facing the same issue .

How to avoid the ambiguity ?


On 27 April 2016 at 11:54, Takeshi Yamamuro  wrote:

> Hi,
>
> I tried;
> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
> val df3 = df1.join(df2, "a")
> val df4 = df3.join(df2, "b")
>
> And I got; org.apache.spark.sql.AnalysisException: Reference 'b' is
> ambiguous, could be: b#6, b#14.;
> If same case, this message makes sense and this is clear.
>
> Thought?
>
> // maropu
>
>
>
>
>
>
>
> On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla 
> wrote:
>
>> Also, check the column names of df1 ( after joining df2 and df3 ).
>>
>> Prasad.
>>
>> From: Ted Yu
>> Date: Monday, April 25, 2016 at 8:35 PM
>> To: Divya Gehlot
>> Cc: "user @spark"
>> Subject: Re: Cant join same dataframe twice ?
>>
>> Can you show us the structure of df2 and df3 ?
>>
>> Thanks
>>
>> On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot 
>> wrote:
>>
>>> Hi,
>>> I am using Spark 1.5.2 .
>>> I have a use case where I need to join the same dataframe twice on two
>>> different columns.
>>> I am getting error missing Columns
>>>
>>> For instance ,
>>> val df1 = df2.join(df3,"Column1")
>>> Below throwing error missing columns
>>> val df 4 = df1.join(df3,"Column2")
>>>
>>> Is the bug or valid scenario ?
>>>
>>>
>>>
>>>
>>> Thanks,
>>> Divya
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Streaming K-means not printing predictions

2016-04-26 Thread Prashant Sharma
Since you are reading from file stream, I would suggest instead of printing
try to save it on a file. There may be output the first time and then no
data in subsequent iterations.

Prashant Sharma



On Tue, Apr 26, 2016 at 7:40 PM, Ashutosh Kumar 
wrote:

> I created a Streaming k means based on scala example. It keeps running
> without any error but never prints predictions
>
> Here is Log
>
> 19:15:05,050 INFO
> org.apache.spark.streaming.scheduler.InputInfoTracker - remove old
> batch metadata: 146167824 ms
> 19:15:10,001 INFO
> org.apache.spark.streaming.dstream.FileInputDStream   - Finding new
> files took 1 ms
> 19:15:10,001 INFO
> org.apache.spark.streaming.dstream.FileInputDStream   - New files
> at time 146167831 ms:
>
> 19:15:10,007 INFO
> org.apache.spark.streaming.dstream.FileInputDStream   - Finding new
> files took 2 ms
> 19:15:10,007 INFO
> org.apache.spark.streaming.dstream.FileInputDStream   - New files
> at time 146167831 ms:
>
> 19:15:10,014 INFO
> org.apache.spark.streaming.scheduler.JobScheduler - Added jobs
> for time 146167831 ms
> 19:15:10,015 INFO
> org.apache.spark.streaming.scheduler.JobScheduler - Starting
> job streaming job 146167831 ms.0 from job set of time 146167831 ms
> 19:15:10,028 INFO
> org.apache.spark.SparkContext - Starting
> job: collect at StreamingKMeans.scala:89
> 19:15:10,028 INFO
> org.apache.spark.scheduler.DAGScheduler   - Job 292
> finished: collect at StreamingKMeans.scala:89, took 0.41 s
> 19:15:10,029 INFO
> org.apache.spark.streaming.scheduler.JobScheduler - Finished
> job streaming job 146167831 ms.0 from job set of time 146167831 ms
> 19:15:10,029 INFO
> org.apache.spark.streaming.scheduler.JobScheduler - Starting
> job streaming job 146167831 ms.1 from job set of time 146167831 ms
> ---
> Time: 146167831 ms
> ---
>
> 19:15:10,036 INFO
> org.apache.spark.streaming.scheduler.JobScheduler - Finished
> job streaming job 146167831 ms.1 from job set of time 146167831 ms
> 19:15:10,036 INFO
> org.apache.spark.rdd.MapPartitionsRDD - Removing
> RDD 2912 from persistence list
> 19:15:10,037 INFO
> org.apache.spark.rdd.MapPartitionsRDD - Removing
> RDD 2911 from persistence list
> 19:15:10,037 INFO
> org.apache.spark.storage.BlockManager - Removing
> RDD 2912
> 19:15:10,037 INFO
> org.apache.spark.streaming.scheduler.JobScheduler - Total
> delay: 0.036 s for time 146167831 ms (execution: 0.021 s)
> 19:15:10,037 INFO
> org.apache.spark.rdd.UnionRDD - Removing
> RDD 2800 from persistence list
> 19:15:10,037 INFO
> org.apache.spark.storage.BlockManager - Removing
> RDD 2911
> 19:15:10,037 INFO
> org.apache.spark.streaming.dstream.FileInputDStream   - Cleared 1
> old files that were older than 146167825 ms: 1461678245000 ms
> 19:15:10,037 INFO
> org.apache.spark.rdd.MapPartitionsRDD - Removing
> RDD 2917 from persistence list
> 19:15:10,037 INFO
> org.apache.spark.storage.BlockManager - Removing
> RDD 2800
> 19:15:10,037 INFO
> org.apache.spark.rdd.MapPartitionsRDD - Removing
> RDD 2916 from persistence list
> 19:15:10,037 INFO
> org.apache.spark.rdd.MapPartitionsRDD - Removing
> RDD 2915 from persistence list
> 19:15:10,037 INFO
> org.apache.spark.rdd.MapPartitionsRDD - Removing
> RDD 2914 from persistence list
> 19:15:10,037 INFO
> org.apache.spark.rdd.UnionRDD - Removing
> RDD 2803 from persistence list
> 19:15:10,037 INFO
> org.apache.spark.streaming.dstream.FileInputDStream   - Cleared 1
> old files that were older than 146167825 ms: 1461678245000 ms
> 19:15:10,038 INFO
> org.apache.spark.streaming.scheduler.ReceivedBlockTracker - Deleting
> batches ArrayBuffer()
> 19:15:10,038 INFO
> org.apache.spark.storage.BlockManager - Removing
> RDD 2917
> 19:15:10,038 INFO
> org.apache.spark.streaming.scheduler.InputInfoTracker - remove old
> batch metadata: 1461678245000 ms
> 19:15:10,038 INFO
> org.apache.spark.storage.BlockManager - Removing
> RDD 2914
> 19:15:10,038 INFO
> org.apache.spark.storage.BlockManager - Removing
> RDD 2916
> 19:15:10,038 INFO
> org.apache.spark.storage.BlockManager - Removing
> RDD 2915
> 19:15:10,038 INFO
> org.apache.spark.storage.BlockManager - Removing
> RDD 2803
> 19:15:15,001 INFO
> org.apache.spark.streaming.dstream.FileInputDStream   - Finding new
> files took 1 ms
> 

Re: Save RDD to HDFS using Spark Python API

2016-04-26 Thread Prashant Sharma
What Davies said is correct, second argument is hadoop's output format.
Hadoop supports many type of output format's and all of them have their own
advantages. Apart from the one specified above,
https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html
is one such formatter class.


thanks,

Prashant Sharma



On Wed, Apr 27, 2016 at 5:22 AM, Davies Liu  wrote:

> hdfs://192.168.10.130:9000/dev/output/test already exists, so you need
> to remove it first.
>
> On Tue, Apr 26, 2016 at 5:28 AM, Luke Adolph  wrote:
> > Hi, all:
> > Below is my code:
> >
> > from pyspark import *
> > import re
> >
> > def getDateByLine(input_str):
> > str_pattern = '^\d{4}-\d{2}-\d{2}'
> > pattern = re.compile(str_pattern)
> > match = pattern.match(input_str)
> > if match:
> > return match.group()
> > else:
> > return None
> >
> > file_url = "hdfs://192.168.10.130:9000/dev/test/test.log"
> > input_file = sc.textFile(file_url)
> > line = input_file.filter(getDateByLine).map(lambda x: (x[:10], 1))
> > counts = line.reduceByKey(lambda a,b: a+b)
> > print counts.collect()
> > counts.saveAsHadoopFile("hdfs://192.168.10.130:9000/dev/output/test",\
> >
>  "org.apache.hadoop.mapred.SequenceFileOutputFormat")
> >
> >
> > What I confused is the method saveAsHadoopFile,I have read the pyspark
> API,
> > But I still don’t understand the second arg mean
> >
> > Below is the output when I run above code:
> > ```
> >
> > [(u'2016-02-29', 99), (u'2016-03-02', 30)]
> >
> >
> ---
> > Py4JJavaError Traceback (most recent call
> last)
> >  in ()
> >  18 counts = line.reduceByKey(lambda a,b: a+b)
> >  19 print counts.collect()
> > ---> 20
> > counts.saveAsHadoopFile("hdfs://192.168.10.130:9000/dev/output/test",
> > "org.apache.hadoop.mapred.SequenceFileOutputFormat")
> >
> > /mydata/softwares/spark-1.6.1/python/pyspark/rdd.pyc in
> > saveAsHadoopFile(self, path, outputFormatClass, keyClass, valueClass,
> > keyConverter, valueConverter, conf, compressionCodecClass)
> >1419  keyClass,
> > valueClass,
> >1420  keyConverter,
> > valueConverter,
> > -> 1421  jconf,
> > compressionCodecClass)
> >1422
> >1423 def saveAsSequenceFile(self, path,
> compressionCodecClass=None):
> >
> >
> /mydata/softwares/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
> > in __call__(self, *args)
> > 811 answer = self.gateway_client.send_command(command)
> > 812 return_value = get_return_value(
> > --> 813 answer, self.gateway_client, self.target_id,
> self.name)
> > 814
> > 815 for temp_arg in temp_args:
> >
> > /mydata/softwares/spark-1.6.1/python/pyspark/sql/utils.pyc in deco(*a,
> **kw)
> >  43 def deco(*a, **kw):
> >  44 try:
> > ---> 45 return f(*a, **kw)
> >  46 except py4j.protocol.Py4JJavaError as e:
> >  47 s = e.java_exception.toString()
> >
> >
> /mydata/softwares/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/protocol.py
> > in get_return_value(answer, gateway_client, target_id, name)
> > 306 raise Py4JJavaError(
> > 307 "An error occurred while calling
> {0}{1}{2}.\n".
> > --> 308 format(target_id, ".", name), value)
> > 309 else:
> > 310 raise Py4JError(
> >
> > Py4JJavaError: An error occurred while calling
> > z:org.apache.spark.api.python.PythonRDD.saveAsHadoopFile.
> > : org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
> > hdfs://192.168.10.130:9000/dev/output/test already exists
> >   at
> >
> org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
> >   at
> >
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1179)
> >   at
> >
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1156)
> >   at
> >
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1156)
> >   at
> >
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> >   at
> >
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
> >   at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
> >   at
> >
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1156)
> >   at
> >
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1060)
> >   at
> >
> 

Re: Cant join same dataframe twice ?

2016-04-26 Thread Takeshi Yamamuro
Yeah, I think so. This is a kind of common mistakes.

// maropu

On Wed, Apr 27, 2016 at 1:05 PM, Ted Yu  wrote:

> The ambiguity came from:
>
> scala> df3.schema
> res0: org.apache.spark.sql.types.StructType =
> StructType(StructField(a,IntegerType,false),
> StructField(b,IntegerType,false), StructField(b,IntegerType,false))
>
> On Tue, Apr 26, 2016 at 8:54 PM, Takeshi Yamamuro 
> wrote:
>
>> Hi,
>>
>> I tried;
>> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
>> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
>> val df3 = df1.join(df2, "a")
>> val df4 = df3.join(df2, "b")
>>
>> And I got; org.apache.spark.sql.AnalysisException: Reference 'b' is
>> ambiguous, could be: b#6, b#14.;
>> If same case, this message makes sense and this is clear.
>>
>> Thought?
>>
>> // maropu
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla 
>> wrote:
>>
>>> Also, check the column names of df1 ( after joining df2 and df3 ).
>>>
>>> Prasad.
>>>
>>> From: Ted Yu
>>> Date: Monday, April 25, 2016 at 8:35 PM
>>> To: Divya Gehlot
>>> Cc: "user @spark"
>>> Subject: Re: Cant join same dataframe twice ?
>>>
>>> Can you show us the structure of df2 and df3 ?
>>>
>>> Thanks
>>>
>>> On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot 
>>> wrote:
>>>
 Hi,
 I am using Spark 1.5.2 .
 I have a use case where I need to join the same dataframe twice on two
 different columns.
 I am getting error missing Columns

 For instance ,
 val df1 = df2.join(df3,"Column1")
 Below throwing error missing columns
 val df 4 = df1.join(df3,"Column2")

 Is the bug or valid scenario ?




 Thanks,
 Divya

>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


-- 
---
Takeshi Yamamuro


Re: Cant join same dataframe twice ?

2016-04-26 Thread Ted Yu
The ambiguity came from:

scala> df3.schema
res0: org.apache.spark.sql.types.StructType =
StructType(StructField(a,IntegerType,false),
StructField(b,IntegerType,false), StructField(b,IntegerType,false))

On Tue, Apr 26, 2016 at 8:54 PM, Takeshi Yamamuro 
wrote:

> Hi,
>
> I tried;
> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
> val df3 = df1.join(df2, "a")
> val df4 = df3.join(df2, "b")
>
> And I got; org.apache.spark.sql.AnalysisException: Reference 'b' is
> ambiguous, could be: b#6, b#14.;
> If same case, this message makes sense and this is clear.
>
> Thought?
>
> // maropu
>
>
>
>
>
>
>
> On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla 
> wrote:
>
>> Also, check the column names of df1 ( after joining df2 and df3 ).
>>
>> Prasad.
>>
>> From: Ted Yu
>> Date: Monday, April 25, 2016 at 8:35 PM
>> To: Divya Gehlot
>> Cc: "user @spark"
>> Subject: Re: Cant join same dataframe twice ?
>>
>> Can you show us the structure of df2 and df3 ?
>>
>> Thanks
>>
>> On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot 
>> wrote:
>>
>>> Hi,
>>> I am using Spark 1.5.2 .
>>> I have a use case where I need to join the same dataframe twice on two
>>> different columns.
>>> I am getting error missing Columns
>>>
>>> For instance ,
>>> val df1 = df2.join(df3,"Column1")
>>> Below throwing error missing columns
>>> val df 4 = df1.join(df3,"Column2")
>>>
>>> Is the bug or valid scenario ?
>>>
>>>
>>>
>>>
>>> Thanks,
>>> Divya
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Cant join same dataframe twice ?

2016-04-26 Thread Takeshi Yamamuro
Hi,

I tried;
val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
val df3 = df1.join(df2, "a")
val df4 = df3.join(df2, "b")

And I got; org.apache.spark.sql.AnalysisException: Reference 'b' is
ambiguous, could be: b#6, b#14.;
If same case, this message makes sense and this is clear.

Thought?

// maropu







On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla  wrote:

> Also, check the column names of df1 ( after joining df2 and df3 ).
>
> Prasad.
>
> From: Ted Yu
> Date: Monday, April 25, 2016 at 8:35 PM
> To: Divya Gehlot
> Cc: "user @spark"
> Subject: Re: Cant join same dataframe twice ?
>
> Can you show us the structure of df2 and df3 ?
>
> Thanks
>
> On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot 
> wrote:
>
>> Hi,
>> I am using Spark 1.5.2 .
>> I have a use case where I need to join the same dataframe twice on two
>> different columns.
>> I am getting error missing Columns
>>
>> For instance ,
>> val df1 = df2.join(df3,"Column1")
>> Below throwing error missing columns
>> val df 4 = df1.join(df3,"Column2")
>>
>> Is the bug or valid scenario ?
>>
>>
>>
>>
>> Thanks,
>> Divya
>>
>
>


-- 
---
Takeshi Yamamuro


Authorization Support(on all operations not only DDL) in Spark Sql

2016-04-26 Thread our...@cnsuning.com
hi rxin,  
Will   Spark sql   Support  Authorization  not only DDL ? 
 In my user case ,a  hive table  was granted  read to  userA and other 
user don't have permission to read  , but userB can read this hive table using 
spark sql.

 











Ricky  Ou







Re: JavaSparkContext.wholeTextFiles read directory

2016-04-26 Thread Mail.com
wholeTextFiles() works.  It is just that it does not provide the parallelism.

This is on Spark 1.4. HDP 2.3.2. Batch jobs.

Thanks

> On Apr 26, 2016, at 9:16 PM, Harjit Singh  
> wrote:
> 
> You will have to write your customReceiver to do that. I don’t think 
> wholeTextFile is designed for that.
> 
> - Harjit
>> On Apr 26, 2016, at 7:19 PM, Mail.com  wrote:
>> 
>> 
>> Hi All,
>> I am reading entire directory of gz XML files with wholeTextFiles. 
>> 
>> I understand as it is gz and with wholeTextFiles the individual files are 
>> not splittable but why the entire directory is read by one executor, single 
>> task. I have provided number of executors as number of files in that 
>> directory.
>> 
>> Is the only option here is to repartition after the xmls are read and parsed 
>> with JaxB.
>> 
>> Regards,
>> Pradeep
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
> 
> v/r,
> Harjit Singh
> Decipher Technology Studios
> email:harjit.sin...@deciphernow.com
> mobile: 303-870-0883
> website: deciphertechstudios.com 
> 
> GPG:
> keyserver: hkps://hkps.pool.sks-keyservers.net
> keyid: D814A2EF
> 
> 
> 
> 
> 


Re: JavaSparkContext.wholeTextFiles read directory

2016-04-26 Thread Harjit Singh
You will have to write your customReceiver to do that. I don’t think 
wholeTextFile is designed for that.

- Harjit
> On Apr 26, 2016, at 7:19 PM, Mail.com  wrote:
> 
> 
> Hi All,
> I am reading entire directory of gz XML files with wholeTextFiles.
> 
> I understand as it is gz and with wholeTextFiles the individual files are not 
> splittable but why the entire directory is read by one executor, single task. 
> I have provided number of executors as number of files in that directory.
> 
> Is the only option here is to repartition after the xmls are read and parsed 
> with JaxB.
> 
> Regards,
> Pradeep
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

v/r,
Harjit Singh
Decipher Technology Studios
email:harjit.sin...@deciphernow.com 

mobile: 303- 870-0883
website: deciphertechstudios.com  
>

GPG:
keyserver: hkps://hkps.pool.sks-keyservers.net 

keyid: D814A2EF







signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Save RDD to HDFS using Spark Python API

2016-04-26 Thread Davies Liu
hdfs://192.168.10.130:9000/dev/output/test already exists, so you need
to remove it first.

On Tue, Apr 26, 2016 at 5:28 AM, Luke Adolph  wrote:
> Hi, all:
> Below is my code:
>
> from pyspark import *
> import re
>
> def getDateByLine(input_str):
> str_pattern = '^\d{4}-\d{2}-\d{2}'
> pattern = re.compile(str_pattern)
> match = pattern.match(input_str)
> if match:
> return match.group()
> else:
> return None
>
> file_url = "hdfs://192.168.10.130:9000/dev/test/test.log"
> input_file = sc.textFile(file_url)
> line = input_file.filter(getDateByLine).map(lambda x: (x[:10], 1))
> counts = line.reduceByKey(lambda a,b: a+b)
> print counts.collect()
> counts.saveAsHadoopFile("hdfs://192.168.10.130:9000/dev/output/test",\
> "org.apache.hadoop.mapred.SequenceFileOutputFormat")
>
>
> What I confused is the method saveAsHadoopFile,I have read the pyspark API,
> But I still don’t understand the second arg mean
>
> Below is the output when I run above code:
> ```
>
> [(u'2016-02-29', 99), (u'2016-03-02', 30)]
>
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
>  18 counts = line.reduceByKey(lambda a,b: a+b)
>  19 print counts.collect()
> ---> 20
> counts.saveAsHadoopFile("hdfs://192.168.10.130:9000/dev/output/test",
> "org.apache.hadoop.mapred.SequenceFileOutputFormat")
>
> /mydata/softwares/spark-1.6.1/python/pyspark/rdd.pyc in
> saveAsHadoopFile(self, path, outputFormatClass, keyClass, valueClass,
> keyConverter, valueConverter, conf, compressionCodecClass)
>1419  keyClass,
> valueClass,
>1420  keyConverter,
> valueConverter,
> -> 1421  jconf,
> compressionCodecClass)
>1422
>1423 def saveAsSequenceFile(self, path, compressionCodecClass=None):
>
> /mydata/softwares/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
> in __call__(self, *args)
> 811 answer = self.gateway_client.send_command(command)
> 812 return_value = get_return_value(
> --> 813 answer, self.gateway_client, self.target_id, self.name)
> 814
> 815 for temp_arg in temp_args:
>
> /mydata/softwares/spark-1.6.1/python/pyspark/sql/utils.pyc in deco(*a, **kw)
>  43 def deco(*a, **kw):
>  44 try:
> ---> 45 return f(*a, **kw)
>  46 except py4j.protocol.Py4JJavaError as e:
>  47 s = e.java_exception.toString()
>
> /mydata/softwares/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/protocol.py
> in get_return_value(answer, gateway_client, target_id, name)
> 306 raise Py4JJavaError(
> 307 "An error occurred while calling {0}{1}{2}.\n".
> --> 308 format(target_id, ".", name), value)
> 309 else:
> 310 raise Py4JError(
>
> Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.saveAsHadoopFile.
> : org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
> hdfs://192.168.10.130:9000/dev/output/test already exists
>   at
> org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
>   at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1179)
>   at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1156)
>   at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1156)
>   at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
>   at
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1156)
>   at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1060)
>   at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1026)
>   at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1026)
>   at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
>   at
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1026)
>   at
> org.apache.spark.api.python.PythonRDD$.saveAsHadoopFile(PythonRDD.scala:753)
>   at 
> 

Re: EOFException while reading from HDFS

2016-04-26 Thread Davies Liu
The Spark package you are using is packaged with Hadoop 2.6, but the
HDFS is Hadoop 1.0.4, they are not compatible.

On Tue, Apr 26, 2016 at 11:18 AM, Bibudh Lahiri  wrote:
> Hi,
>   I am trying to load a CSV file which is on HDFS. I have two machines:
> IMPETUS-1466 (172.26.49.156) and IMPETUS-1325 (172.26.49.55). Both have
> Spark 1.6.0 pre-built for Hadoop 2.6 and later, but for both, I had existing
> Hadoop clusters running Hadoop 1.0.4. I have launched HDFS from
> 172.26.49.156 by running start-dfs.sh from it, copied files from local file
> system to HDFS and can view them by hadoop fs -ls.
>
>   However, when I am trying to load the CSV file from pyspark shell
> (launched by bin/pyspark --packages com.databricks:spark-csv_2.10:1.3.0)
> from IMPETUS-1325 (172.26.49.55) with the following commands:
>
>
>>>from pyspark.sql import SQLContext
>
>>>sqlContext = SQLContext(sc)
>
>>>patients_df =
>>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>>> "false").load("hdfs://172.26.49.156:54310/bibudh/healthcare/data/cloudera_challenge/patients.csv")
>
>
> I get the following error:
>
>
> java.io.EOFException: End of File Exception between local host is:
> "IMPETUS-1325.IMPETUS.CO.IN/172.26.49.55"; destination host is:
> "IMPETUS-1466":54310; : java.io.EOFException; For more details see:
> http://wiki.apache.org/hadoop/EOFException
>
>
> U have changed the port number from 54310 to 8020, but then I get the error
>
>
> java.net.ConnectException: Call From IMPETUS-1325.IMPETUS.CO.IN/172.26.49.55
> to IMPETUS-1466:8020 failed on connection exception:
> java.net.ConnectException: Connection refused; For more details see:
> http://wiki.apache.org/hadoop/ConnectionRefused
>
>
> To me it seemed like this may result from a version mismatch between Spark
> Hadoop client and my Hadoop cluster, so I have made the following changes:
>
>
> 1) Added the following lines to conf/spark-env.sh
>
>
> export HADOOP_HOME="/usr/local/hadoop-1.0.4" export
> HADOOP_CONF_DIR="$HADOOP_HOME/conf" export
> HDFS_URL="hdfs://172.26.49.156:8020"
>
>
> 2) Downloaded Spark 1.6.0, pre-built with user-provided Hadoop, and in
> addition to the three lines above, added the following line to
> conf/spark-env.sh
>
>
> export SPARK_DIST_CLASSPATH="/usr/local/hadoop-1.0.4/bin/hadoop"
>
>
> but none of it seems to work. However, the following command works from
> 172.26.49.55 and gives the directory listing:
>
> /usr/local/hadoop-1.0.4/bin/hadoop fs -ls hdfs://172.26.49.156:54310/
>
>
> Any suggestion?
>
>
> Thanks
>
> Bibudh
>
>
> --
> Bibudh Lahiri
> Data Scientist, Impetus Technolgoies
> 5300 Stevens Creek Blvd
> San Jose, CA 95129
> http://knowthynumbers.blogspot.com/
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



JavaSparkContext.wholeTextFiles read directory

2016-04-26 Thread Mail.com

Hi All,
I am reading entire directory of gz XML files with wholeTextFiles. 

I understand as it is gz and with wholeTextFiles the individual files are not 
splittable but why the entire directory is read by one executor, single task. I 
have provided number of executors as number of files in that directory.

Is the only option here is to repartition after the xmls are read and parsed 
with JaxB.

Regards,
Pradeep
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Cant join same dataframe twice ?

2016-04-26 Thread Prasad Ravilla
Also, check the column names of df1 ( after joining df2 and df3 ).

Prasad.

From: Ted Yu
Date: Monday, April 25, 2016 at 8:35 PM
To: Divya Gehlot
Cc: "user @spark"
Subject: Re: Cant join same dataframe twice ?

Can you show us the structure of df2 and df3 ?

Thanks

On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot 
> wrote:
Hi,
I am using Spark 1.5.2 .
I have a use case where I need to join the same dataframe twice on two 
different columns.
I am getting error missing Columns

For instance ,
val df1 = df2.join(df3,"Column1")
Below throwing error missing columns
val df 4 = df1.join(df3,"Column2")

Is the bug or valid scenario ?




Thanks,
Divya



Last RDD always being Run

2016-04-26 Thread Harjit Singh
i'm running the LogAnalyzerStreaming Example. Its processing the files fine but 
it keeps emitting the output of last processed RDD until it gets a new one. Is 
there a way to prevent that. I'm planning to use this example in a real 
scenario where when I have processed the data, I would be pushing it to a 
Database. So if the RDD keeps emitting the same values, I would be persisting 
the same values again and again, till I don't get new one. Any ideas how to do 
it? I have tried using unPersist() on the RDD, but it doesn't help.

Thanks

Harjit






signature.asc
Description: Message signed with OpenPGP using GPGMail


test

2016-04-26 Thread Harjit Singh








signature.asc
Description: Message signed with OpenPGP using GPGMail


EOFException while reading from HDFS

2016-04-26 Thread Bibudh Lahiri
Hi,
  I am trying to load a CSV file which is on HDFS. I have two
machines: IMPETUS-1466 (172.26.49.156) and IMPETUS-1325 (172.26.49.55).
Both have Spark 1.6.0 pre-built for Hadoop 2.6 and later, but for both, I
had existing Hadoop clusters running Hadoop 1.0.4. I have launched HDFS
from 172.26.49.156 by running start-dfs.sh from it, copied files from local
file system to HDFS and can view them by hadoop fs -ls.

  However, when I am trying to load the CSV file from pyspark shell
(launched by bin/pyspark --packages com.databricks:spark-csv_2.10:1.3.0)
from IMPETUS-1325 (172.26.49.55) with the following commands:


>>from pyspark.sql import SQLContext

>>sqlContext = SQLContext(sc)

>>patients_df =
sqlContext.read.format("com.databricks.spark.csv").option("header",
"false").load("hdfs://
172.26.49.156:54310/bibudh/healthcare/data/cloudera_challenge/patients.csv")


I get the following error:


java.io.EOFException: End of File Exception between local host is: "
IMPETUS-1325.IMPETUS.CO.IN/172.26.49.55"; destination host is:
"IMPETUS-1466":54310; : java.io.EOFException; For more details see:
http://wiki.apache.org/hadoop/EOFException


U have changed the port number from 54310 to 8020, but then I get the error


java.net.ConnectException: Call From IMPETUS-1325.IMPETUS.CO.IN/172.26.49.55
to IMPETUS-1466:8020 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused


To me it seemed like this may result from a version mismatch between Spark
Hadoop client and my Hadoop cluster, so I have made the following changes:


1) Added the following lines to conf/spark-env.sh


export HADOOP_HOME="/usr/local/hadoop-1.0.4" export
HADOOP_CONF_DIR="$HADOOP_HOME/conf" export HDFS_URL="hdfs://
172.26.49.156:8020"


2) Downloaded Spark 1.6.0, pre-built with user-provided Hadoop, and in
addition to the three lines above, added the following line to
conf/spark-env.sh


export SPARK_DIST_CLASSPATH="/usr/local/hadoop-1.0.4/bin/hadoop"


but none of it seems to work. However, the following command works from
172.26.49.55 and gives the directory listing:

/usr/local/hadoop-1.0.4/bin/hadoop fs -ls hdfs://172.26.49.156:54310/


Any suggestion?


Thanks

Bibudh


-- 
Bibudh Lahiri
Data Scientist, Impetus Technolgoies
5300 Stevens Creek Blvd
San Jose, CA 95129
http://knowthynumbers.blogspot.com/


Re: Reading from Amazon S3

2016-04-26 Thread Ted Yu
Looking at the cause of the error, it seems hadoop-aws-xx.jar
(corresponding to the version of hadoop you use) was missing in classpath.

FYI

On Tue, Apr 26, 2016 at 9:06 AM, Jinan Alhajjaj 
wrote:

> Hi All,
> I am trying to read a file stored in Amazon S3.
> I wrote this code:
>
> import java.util.List;
>
> import java.util.Scanner;
>
> import org.apache.spark.SparkConf;
>
> import org.apache.spark.api.java.JavaRDD;
>
> import org.apache.spark.api.java.JavaSparkContext;
>
> import org.apache.spark.api.java.function.Function;
>
> import org.apache.spark.sql.DataFrame;
>
> import org.apache.spark.sql.Row;
>
> import org.apache.spark.sql.SQLContext;
>
> public class WordAnalysis {
>
> public static void main(String[] args) {
>
> int startYear=0;
>
> int endyear=0;
>
> Scanner input = new Scanner(System.in);
>
> System.out.println("Please, Enter 1 if you want 1599-2008 or enter 2
> for specific range: ");
>
> int choice=input.nextInt();
>
>
>
> if(choice==1)
>
> {
>
> startYear=1500;
>
> endyear=2008;
>
> }
>
> if(choice==2)
>
> {
>
> System.out.print("please,Enter the start year : ");
>
> startYear = input.nextInt();
>
> System.out.print("please,Enter the end year : ");
>
> endyear = input.nextInt();
>
> }
>
> SparkConf conf = new SparkConf().setAppName("jinantry").setMaster("local"
> );
>
> JavaSparkContext spark = new JavaSparkContext(conf);
>
> SQLContext sqlContext = new org.apache.spark.sql.SQLContext(spark);
>
> JavaRDD ngram = spark.textFile(
> "s3n://google-books-ngram/1gram/googlebooks-eng-all-1gram-20120701-x.gz‏")
>
> .map(new Function() {
>
> public Items call(String line) throws Exception {
>
> String[] parts = line.split("\t");
>
> Items item = new Items();
>
> if (parts.length == 4) {
>
> item.setWord(parts[0]);
>
> item.setYear(Integer.parseInt(parts[1]));
>
> item.setCount(Integer.parseInt(parts[2]));
>
> item.setVolume(Integer.parseInt(parts[3]));
>
> return item;
>
> } else {
>
> item.setWord(" ");
>
> item.setYear(Integer.parseInt(" "));
>
> item.setCount(Integer.parseInt(" "));
>
> item.setVolume(Integer.parseInt(" "));
>
> return item;
>
> }
>
> }
>
> });
>
> DataFrame schemangram = sqlContext.createDataFrame(ngram, Items.class);
>
> schemangram.registerTempTable("ngram");
>
> String sql="SELECT word,SUM(count) FROM ngram where year >="+startYear+"
> AND year<="+endyear+" And word LIKE '%_NOUN' GROUP BY word ORDER BY
> SUM(count) DESC";
>
> DataFrame matchyear = sqlContext.sql(sql);
>
> List words=matchyear.collectAsList();
>
> int i=1;
>
> for (Row scholar : words) {
>
> System.out.println(scholar);
>
> if(i==10)
>
> break;
>
> i++;
>
>   }
>
>
> }
>
>
> }
>
>
> When I run it this error appear to me:
>
> Exception in thread "main"
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute,
> tree:
>
> Exchange rangepartitioning(aggOrder#5L DESC,200), None
>
> +- ConvertToSafe
>
>+- TungstenAggregate(key=[word#2], functions=[(sum(cast(count#0 as
> bigint)),mode=Final,isDistinct=false)], output=[word#2,_c1#4L,aggOrder#5L])
>
>   +- TungstenExchange hashpartitioning(word#2,200), None
>
>  +- TungstenAggregate(key=[word#2], functions=[(sum(cast(count#0
> as bigint)),mode=Partial,isDistinct=false)], output=[word#2,sum#8L])
>
> +- Project [word#2,count#0]
>
>+- Filter (((year#3 >= 1500) && (year#3 <= 1600)) && word#2
> LIKE %_NOUN)
>
>   +- Scan ExistingRDD[count#0,volume#1,word#2,year#3]
>
>
> at
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
>
> at org.apache.spark.sql.execution.Exchange.doExecute(Exchange.scala:247)
>
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>
> at
> org.apache.spark.sql.execution.ConvertToUnsafe.doExecute(rowFormatConverters.scala:38)
>
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>
> at org.apache.spark.sql.execution.Sort.doExecute(Sort.scala:64)
>
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>
> at
> 

Re: Fill Gaps between rows

2016-04-26 Thread Andrés Ivaldi
Thanks Sebastian, I've never understood that part of Hive Context, so It's
possible to use HiveContext then use the window functions and save
dataFrame into another source like MSSQL, Oracle, or any with JDBC ?

Regards.

On Tue, Apr 26, 2016 at 1:22 PM, Sebastian Piu 
wrote:

> Yes you need hive Context for the window functions, but you don't need
> hive for it to work
>
> On Tue, 26 Apr 2016, 14:15 Andrés Ivaldi,  wrote:
>
>> Hello, do exists an Out Of the box for fill in gaps between rows with a
>> given  condition?
>> As example: I have a source table with data and a column with the day
>> number, but the record only register a event and no necessary all days have
>> events, so the table no necessary has all days. But I want a resultant
>> Table with all days, filled in the data with 0 o same as row before.
>>
>> I'm using SQLContext. I think window function will do that, but I cant
>> use it without hive context, Is that right?
>>
>> regards
>>
>>
>> --
>> Ing. Ivaldi Andres
>>
>


-- 
Ing. Ivaldi Andres


Re: Fill Gaps between rows

2016-04-26 Thread Sebastian Piu
Yes you need hive Context for the window functions, but you don't need hive
for it to work

On Tue, 26 Apr 2016, 14:15 Andrés Ivaldi,  wrote:

> Hello, do exists an Out Of the box for fill in gaps between rows with a
> given  condition?
> As example: I have a source table with data and a column with the day
> number, but the record only register a event and no necessary all days have
> events, so the table no necessary has all days. But I want a resultant
> Table with all days, filled in the data with 0 o same as row before.
>
> I'm using SQLContext. I think window function will do that, but I cant use
> it without hive context, Is that right?
>
> regards
>
>
> --
> Ing. Ivaldi Andres
>


Reading from Amazon S3

2016-04-26 Thread Jinan Alhajjaj
Hi All,I am trying to read a file stored in Amazon S3.I wrote this code:import 
java.util.List;
import java.util.Scanner;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
public class WordAnalysis {
public static void main(String[] args) {
int startYear=0;
int endyear=0;
Scanner input = new Scanner(System.in);  
System.out.println("Please, Enter 1 if you want 1599-2008 or enter 
2 for specific range: ");
int choice=input.nextInt();
   
if(choice==1)
{
startYear=1500;
endyear=2008;
}
if(choice==2)
{
System.out.print("please,Enter the start year : 
");
startYear = input.nextInt();
System.out.print("please,Enter the end year : 
");
endyear = input.nextInt();
}   
SparkConf conf = new 
SparkConf().setAppName("jinantry").setMaster("local"); 
JavaSparkContext spark = new JavaSparkContext(conf);
SQLContext sqlContext = new 
org.apache.spark.sql.SQLContext(spark);
JavaRDD ngram = 
spark.textFile("s3n://google-books-ngram/1gram/googlebooks-eng-all-1gram-20120701-x.gz‏")
.map(new Function() {
public Items call(String line) throws 
Exception {
String[] parts = 
line.split("\t");
Items item = new Items();
if (parts.length == 4) {

item.setWord(parts[0]);

item.setYear(Integer.parseInt(parts[1]));

item.setCount(Integer.parseInt(parts[2]));

item.setVolume(Integer.parseInt(parts[3]));
return item;
} else {
item.setWord(" 
");

item.setYear(Integer.parseInt(" "));

item.setCount(Integer.parseInt(" "));

item.setVolume(Integer.parseInt(" "));
return item;
}
}   
});
DataFrame schemangram = sqlContext.createDataFrame(ngram, 
Items.class);
schemangram.registerTempTable("ngram");
String sql="SELECT word,SUM(count) FROM ngram where year 
>="+startYear+" AND year<="+endyear+" And word LIKE '%_NOUN' GROUP BY word 
ORDER BY SUM(count) DESC";  
DataFrame matchyear = sqlContext.sql(sql);
List words=matchyear.collectAsList();
int i=1;
for (Row scholar : words) { 
System.out.println(scholar);
if(i==10)
break;
i++;
   }


}


}

When I run it this error appear to me:Exception in thread "main" 
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange rangepartitioning(aggOrder#5L DESC,200), None
+- ConvertToSafe
   +- TungstenAggregate(key=[word#2], functions=[(sum(cast(count#0 as 
bigint)),mode=Final,isDistinct=false)], output=[word#2,_c1#4L,aggOrder#5L])
  +- TungstenExchange hashpartitioning(word#2,200), None
 +- TungstenAggregate(key=[word#2], functions=[(sum(cast(count#0 as 
bigint)),mode=Partial,isDistinct=false)], output=[word#2,sum#8L])
+- Project [word#2,count#0]
   +- Filter (((year#3 >= 1500) && (year#3 <= 1600)) && word#2 LIKE 
%_NOUN)
  +- Scan ExistingRDD[count#0,volume#1,word#2,year#3] 


at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
at org.apache.spark.sql.execution.Exchange.doExecute(Exchange.scala:247)
at 

Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Hyukjin Kwon
EDIT: not mapper but a task for HadoopRDD maybe as far as I know.

I think the most clear way is just to run a job on multiple files with the
API and check the number of tasks in the job.
On 27 Apr 2016 12:06 a.m., "Hyukjin Kwon"  wrote:

wholeTextFile() API uses WholeTextFileInputFormat,
https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala,
which returns false for isSplittable. In this case, only single mapper
appears for the entire file as far as I know.

And also https://spark.apache.org/docs/1.6.0/programming-guide.html

If the file is single file, then this would not be distributed.
On 26 Apr 2016 11:52 p.m., "Ted Yu"  wrote:

> Please take a look at:
> core/src/main/scala/org/apache/spark/SparkContext.scala
>
>* Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`,
>*
>*  then `rdd` contains
>* {{{
>*   (a-hdfs-path/part-0, its content)
>*   (a-hdfs-path/part-1, its content)
>*   ...
>*   (a-hdfs-path/part-n, its content)
>* }}}
> ...
>   * @param minPartitions A suggestion value of the minimal splitting
> number for input data.
>
>   def wholeTextFiles(
>   path: String,
>   minPartitions: Int = defaultMinPartitions): RDD[(String, String)] =
> withScope {
>
> On Tue, Apr 26, 2016 at 7:43 AM, Vadim Vararu 
> wrote:
>
>> Hi guys,
>>
>> I'm trying to read many filed from s3 using
>> JavaSparkContext.wholeTextFiles(...). Is that executed in a distributed
>> manner? Please give me a link to the place in documentation where it's
>> specified.
>>
>> Thanks, Vadim.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Hyukjin Kwon
wholeTextFile() API uses WholeTextFileInputFormat,
https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala,
which returns false for isSplittable. In this case, only single mapper
appears for the entire file as far as I know.

And also https://spark.apache.org/docs/1.6.0/programming-guide.html

If the file is single file, then this would not be distributed.
On 26 Apr 2016 11:52 p.m., "Ted Yu"  wrote:

> Please take a look at:
> core/src/main/scala/org/apache/spark/SparkContext.scala
>
>* Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`,
>*
>*  then `rdd` contains
>* {{{
>*   (a-hdfs-path/part-0, its content)
>*   (a-hdfs-path/part-1, its content)
>*   ...
>*   (a-hdfs-path/part-n, its content)
>* }}}
> ...
>   * @param minPartitions A suggestion value of the minimal splitting
> number for input data.
>
>   def wholeTextFiles(
>   path: String,
>   minPartitions: Int = defaultMinPartitions): RDD[(String, String)] =
> withScope {
>
> On Tue, Apr 26, 2016 at 7:43 AM, Vadim Vararu 
> wrote:
>
>> Hi guys,
>>
>> I'm trying to read many filed from s3 using
>> JavaSparkContext.wholeTextFiles(...). Is that executed in a distributed
>> manner? Please give me a link to the place in documentation where it's
>> specified.
>>
>> Thanks, Vadim.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Spark SQL query for List

2016-04-26 Thread Ramkumar V
I'm getting following exception if i form a query like this. Its not coming
to the point where get(0) or get(1).

Exception in thread "main" java.lang.RuntimeException: [1.22] failure:
``*'' expected but `cities' found


*Thanks*,



On Tue, Apr 26, 2016 at 4:41 PM, Hyukjin Kwon  wrote:

> Doesn't get(0) give you the Array[String] for CITY (am I missing
> something?)
> On 26 Apr 2016 11:02 p.m., "Ramkumar V"  wrote:
>
> JavaSparkContext ctx = new JavaSparkContext(sparkConf);
>
> SQLContext sqlContext = new SQLContext(ctx);
>
> DataFrame parquetFile = sqlContext.parquetFile(
> "hdfs:/XYZ:8020/user/hdfs/parquet/*.parquet");
>
>parquetFile.registerTempTable("parquetFile");
>
> DataFrame tempDF = sqlContext.sql("SELECT TOUR.CITIES, BUDJET from
> parquetFile");
>
> JavaRDDjRDD = tempDF.toJavaRDD();
>
>  JavaRDD ones = jRDD.map(new Function() {
>
>   public String call(Row row) throws Exception {
>
> return row.getString(1);
>
>   }
>
> });
>
> *Thanks*,
> 
>
>
> On Tue, Apr 26, 2016 at 3:48 PM, Hyukjin Kwon  wrote:
>
>> Could you maybe share your codes?
>> On 26 Apr 2016 9:51 p.m., "Ramkumar V"  wrote:
>>
>>> Hi,
>>>
>>> I had loaded JSON file in parquet format into SparkSQL. I can't able to
>>> read List which is inside JSON.
>>>
>>> Sample JSON
>>>
>>> {
>>> "TOUR" : {
>>>  "CITIES" : ["Paris","Berlin","Prague"]
>>> },
>>> "BUDJET" : 100
>>> }
>>>
>>> I want to read value of CITIES.
>>>
>>> *Thanks*,
>>> 
>>>
>>>
>


Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Vadim Vararu
Spark can create distributed datasets from any storage source supported 
by Hadoop, including your local file system, HDFS, Cassandra, HBase, 
Amazon S3 , etc. Spark supports 
text files, SequenceFiles 
, 
and any other Hadoop InputFormat 
.


Text file RDDs can be created using |SparkContext|’s |textFile| method. 
This method takes an URI for the file (either a local path on the 
machine, or a |hdfs://|, |s3n://|, etc URI) and reads it as a collection 
of lines. Here is an example invocation



I could not find an concrete statement where it says either the read 
(more than one file) is distributed or not.


On 26.04.2016 18:00, Hyukjin Kwon wrote:

then this would not be distributed




Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Hyukjin Kwon
And also https://spark.apache.org/docs/1.6.0/programming-guide.html

If the file is single file, then this would not be distributed.
On 26 Apr 2016 11:52 p.m., "Ted Yu"  wrote:

> Please take a look at:
> core/src/main/scala/org/apache/spark/SparkContext.scala
>
>* Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`,
>*
>*  then `rdd` contains
>* {{{
>*   (a-hdfs-path/part-0, its content)
>*   (a-hdfs-path/part-1, its content)
>*   ...
>*   (a-hdfs-path/part-n, its content)
>* }}}
> ...
>   * @param minPartitions A suggestion value of the minimal splitting
> number for input data.
>
>   def wholeTextFiles(
>   path: String,
>   minPartitions: Int = defaultMinPartitions): RDD[(String, String)] =
> withScope {
>
> On Tue, Apr 26, 2016 at 7:43 AM, Vadim Vararu 
> wrote:
>
>> Hi guys,
>>
>> I'm trying to read many filed from s3 using
>> JavaSparkContext.wholeTextFiles(...). Is that executed in a distributed
>> manner? Please give me a link to the place in documentation where it's
>> specified.
>>
>> Thanks, Vadim.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Ted Yu
Please take a look at:
core/src/main/scala/org/apache/spark/SparkContext.scala

   * Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`,
   *
   *  then `rdd` contains
   * {{{
   *   (a-hdfs-path/part-0, its content)
   *   (a-hdfs-path/part-1, its content)
   *   ...
   *   (a-hdfs-path/part-n, its content)
   * }}}
...
  * @param minPartitions A suggestion value of the minimal splitting number
for input data.

  def wholeTextFiles(
  path: String,
  minPartitions: Int = defaultMinPartitions): RDD[(String, String)] =
withScope {

On Tue, Apr 26, 2016 at 7:43 AM, Vadim Vararu 
wrote:

> Hi guys,
>
> I'm trying to read many filed from s3 using
> JavaSparkContext.wholeTextFiles(...). Is that executed in a distributed
> manner? Please give me a link to the place in documentation where it's
> specified.
>
> Thanks, Vadim.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Vadim Vararu

Hi guys,

I'm trying to read many filed from s3 using 
JavaSparkContext.wholeTextFiles(...). Is that executed in a distributed 
manner? Please give me a link to the place in documentation where it's 
specified.


Thanks, Vadim.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Kafka exception in Apache Spark

2016-04-26 Thread Cody Koeninger
That error indicates a message bigger than the buffer's capacity

https://issues.apache.org/jira/browse/KAFKA-1196


On Tue, Apr 26, 2016 at 3:07 AM, Michel Hubert  wrote:
> Hi,
>
>
>
>
>
> I use a Kafka direct stream approach.
>
> My Spark application was running ok.
>
> This morning we upgraded to CDH 5.7.0
>
> And when I re-started my Spark application I get exceptions.
>
>
>
> It seems a problem with the direct stream approach.
>
> Any ideas how to fix this?
>
>
>
>
>
>
>
> User class threw exception: org.apache.spark.SparkException: Job aborted due
> to stage failure: Task 3 in stage 0.0 failed 4 times, most recent failure:
> Lost task 3.3 in stage 0.0 (TID 26, bfravicsvr81439-cld.opentsp.com):
> java.lang.IllegalArgumentException
>
> at java.nio.Buffer.limit(Buffer.java:267)
>
> at kafka.api.FetchResponsePartitionData$.readFrom(FetchResponse.scala:38)
>
> at kafka.api.TopicData$$anonfun$1.apply(FetchResponse.scala:100)
>
> at kafka.api.TopicData$$anonfun$1.apply(FetchResponse.scala:98)
>
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
> at scala.collection.immutable.Range.foreach(Range.scala:141)
>
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>
> at kafka.api.TopicData$.readFrom(FetchResponse.scala:98)
>
> at kafka.api.FetchResponse$$anonfun$4.apply(FetchResponse.scala:170)
>
> at kafka.api.FetchResponse$$anonfun$4.apply(FetchResponse.scala:169)
>
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>
> at scala.collection.immutable.Range.foreach(Range.scala:141)
>
> at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>
> at kafka.api.FetchResponse$.readFrom(FetchResponse.scala:169)
>
> at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:135)
>
> at
> org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:192)
>
> at
> org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:208)
>
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>
> at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
>
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>
> at
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL query for List

2016-04-26 Thread Hyukjin Kwon
Doesn't get(0) give you the Array[String] for CITY (am I missing something?)
On 26 Apr 2016 11:02 p.m., "Ramkumar V"  wrote:

JavaSparkContext ctx = new JavaSparkContext(sparkConf);

SQLContext sqlContext = new SQLContext(ctx);

DataFrame parquetFile = sqlContext.parquetFile(
"hdfs:/XYZ:8020/user/hdfs/parquet/*.parquet");

   parquetFile.registerTempTable("parquetFile");

DataFrame tempDF = sqlContext.sql("SELECT TOUR.CITIES, BUDJET from
parquetFile");

JavaRDDjRDD = tempDF.toJavaRDD();

 JavaRDD ones = jRDD.map(new Function() {

  public String call(Row row) throws Exception {

return row.getString(1);

  }

});

*Thanks*,



On Tue, Apr 26, 2016 at 3:48 PM, Hyukjin Kwon  wrote:

> Could you maybe share your codes?
> On 26 Apr 2016 9:51 p.m., "Ramkumar V"  wrote:
>
>> Hi,
>>
>> I had loaded JSON file in parquet format into SparkSQL. I can't able to
>> read List which is inside JSON.
>>
>> Sample JSON
>>
>> {
>> "TOUR" : {
>>  "CITIES" : ["Paris","Berlin","Prague"]
>> },
>> "BUDJET" : 100
>> }
>>
>> I want to read value of CITIES.
>>
>> *Thanks*,
>> 
>>
>>


Streaming K-means not printing predictions

2016-04-26 Thread Ashutosh Kumar
I created a Streaming k means based on scala example. It keeps running
without any error but never prints predictions

Here is Log

19:15:05,050 INFO
org.apache.spark.streaming.scheduler.InputInfoTracker - remove old
batch metadata: 146167824 ms
19:15:10,001 INFO
org.apache.spark.streaming.dstream.FileInputDStream   - Finding new
files took 1 ms
19:15:10,001 INFO
org.apache.spark.streaming.dstream.FileInputDStream   - New files
at time 146167831 ms:

19:15:10,007 INFO
org.apache.spark.streaming.dstream.FileInputDStream   - Finding new
files took 2 ms
19:15:10,007 INFO
org.apache.spark.streaming.dstream.FileInputDStream   - New files
at time 146167831 ms:

19:15:10,014 INFO
org.apache.spark.streaming.scheduler.JobScheduler - Added jobs
for time 146167831 ms
19:15:10,015 INFO
org.apache.spark.streaming.scheduler.JobScheduler - Starting
job streaming job 146167831 ms.0 from job set of time 146167831 ms
19:15:10,028 INFO
org.apache.spark.SparkContext - Starting
job: collect at StreamingKMeans.scala:89
19:15:10,028 INFO
org.apache.spark.scheduler.DAGScheduler   - Job 292
finished: collect at StreamingKMeans.scala:89, took 0.41 s
19:15:10,029 INFO
org.apache.spark.streaming.scheduler.JobScheduler - Finished
job streaming job 146167831 ms.0 from job set of time 146167831 ms
19:15:10,029 INFO
org.apache.spark.streaming.scheduler.JobScheduler - Starting
job streaming job 146167831 ms.1 from job set of time 146167831 ms
---
Time: 146167831 ms
---

19:15:10,036 INFO
org.apache.spark.streaming.scheduler.JobScheduler - Finished
job streaming job 146167831 ms.1 from job set of time 146167831 ms
19:15:10,036 INFO
org.apache.spark.rdd.MapPartitionsRDD - Removing
RDD 2912 from persistence list
19:15:10,037 INFO
org.apache.spark.rdd.MapPartitionsRDD - Removing
RDD 2911 from persistence list
19:15:10,037 INFO
org.apache.spark.storage.BlockManager - Removing
RDD 2912
19:15:10,037 INFO
org.apache.spark.streaming.scheduler.JobScheduler - Total
delay: 0.036 s for time 146167831 ms (execution: 0.021 s)
19:15:10,037 INFO
org.apache.spark.rdd.UnionRDD - Removing
RDD 2800 from persistence list
19:15:10,037 INFO
org.apache.spark.storage.BlockManager - Removing
RDD 2911
19:15:10,037 INFO
org.apache.spark.streaming.dstream.FileInputDStream   - Cleared 1
old files that were older than 146167825 ms: 1461678245000 ms
19:15:10,037 INFO
org.apache.spark.rdd.MapPartitionsRDD - Removing
RDD 2917 from persistence list
19:15:10,037 INFO
org.apache.spark.storage.BlockManager - Removing
RDD 2800
19:15:10,037 INFO
org.apache.spark.rdd.MapPartitionsRDD - Removing
RDD 2916 from persistence list
19:15:10,037 INFO
org.apache.spark.rdd.MapPartitionsRDD - Removing
RDD 2915 from persistence list
19:15:10,037 INFO
org.apache.spark.rdd.MapPartitionsRDD - Removing
RDD 2914 from persistence list
19:15:10,037 INFO
org.apache.spark.rdd.UnionRDD - Removing
RDD 2803 from persistence list
19:15:10,037 INFO
org.apache.spark.streaming.dstream.FileInputDStream   - Cleared 1
old files that were older than 146167825 ms: 1461678245000 ms
19:15:10,038 INFO
org.apache.spark.streaming.scheduler.ReceivedBlockTracker - Deleting
batches ArrayBuffer()
19:15:10,038 INFO
org.apache.spark.storage.BlockManager - Removing
RDD 2917
19:15:10,038 INFO
org.apache.spark.streaming.scheduler.InputInfoTracker - remove old
batch metadata: 1461678245000 ms
19:15:10,038 INFO
org.apache.spark.storage.BlockManager - Removing
RDD 2914
19:15:10,038 INFO
org.apache.spark.storage.BlockManager - Removing
RDD 2916
19:15:10,038 INFO
org.apache.spark.storage.BlockManager - Removing
RDD 2915
19:15:10,038 INFO
org.apache.spark.storage.BlockManager - Removing
RDD 2803
19:15:15,001 INFO
org.apache.spark.streaming.dstream.FileInputDStream   - Finding new
files took 1 ms
19:15:15,001 INFO
org.apache.spark.streaming.dstream.FileInputDStream   - New files
at time 1461678315000 ms:
.


StreamingKmeans.java
Description: Binary data

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark SQL query for List

2016-04-26 Thread Ramkumar V
JavaSparkContext ctx = new JavaSparkContext(sparkConf);

SQLContext sqlContext = new SQLContext(ctx);

DataFrame parquetFile = sqlContext.parquetFile(
"hdfs:/XYZ:8020/user/hdfs/parquet/*.parquet");

   parquetFile.registerTempTable("parquetFile");

DataFrame tempDF = sqlContext.sql("SELECT TOUR.CITIES, BUDJET from
parquetFile");

JavaRDDjRDD = tempDF.toJavaRDD();

 JavaRDD ones = jRDD.map(new Function() {

  public String call(Row row) throws Exception {

return row.getString(1);

  }

});

*Thanks*,



On Tue, Apr 26, 2016 at 3:48 PM, Hyukjin Kwon  wrote:

> Could you maybe share your codes?
> On 26 Apr 2016 9:51 p.m., "Ramkumar V"  wrote:
>
>> Hi,
>>
>> I had loaded JSON file in parquet format into SparkSQL. I can't able to
>> read List which is inside JSON.
>>
>> Sample JSON
>>
>> {
>> "TOUR" : {
>>  "CITIES" : ["Paris","Berlin","Prague"]
>> },
>> "BUDJET" : 100
>> }
>>
>> I want to read value of CITIES.
>>
>> *Thanks*,
>> 
>>
>>


Re: Spark SQL query for List

2016-04-26 Thread Hyukjin Kwon
Could you maybe share your codes?
On 26 Apr 2016 9:51 p.m., "Ramkumar V"  wrote:

> Hi,
>
> I had loaded JSON file in parquet format into SparkSQL. I can't able to
> read List which is inside JSON.
>
> Sample JSON
>
> {
> "TOUR" : {
>  "CITIES" : ["Paris","Berlin","Prague"]
> },
> "BUDJET" : 100
> }
>
> I want to read value of CITIES.
>
> *Thanks*,
> 
>
>


Fill Gaps between rows

2016-04-26 Thread Andrés Ivaldi
Hello, do exists an Out Of the box for fill in gaps between rows with a
given  condition?
As example: I have a source table with data and a column with the day
number, but the record only register a event and no necessary all days have
events, so the table no necessary has all days. But I want a resultant
Table with all days, filled in the data with 0 o same as row before.

I'm using SQLContext. I think window function will do that, but I cant use
it without hive context, Is that right?

regards

-- 
Ing. Ivaldi Andres


how to make task is assigned after all executors are launched

2016-04-26 Thread Qian Huang
Hi all,

Since the data I want to process is not on HDFS, I try to use sc.makeRDD() to 
ensure all items of a partition is located on one node, then the task can be 
launched on that node.

Now comes the problem, sometimes, the task is already assigned to some 
executors, then other executors are launched. Thus, when deal with partition on 
node A, spark may launched a executor on node B, which is not my expect.

How can I ensure only until all executors are launched, then assign task? If 
anyone have some idea, please let me know. Thanks so much!



Best Regards,
Qian
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark SQL query for List

2016-04-26 Thread Ramkumar V
Hi,

I had loaded JSON file in parquet format into SparkSQL. I can't able to
read List which is inside JSON.

Sample JSON

{
"TOUR" : {
 "CITIES" : ["Paris","Berlin","Prague"]
},
"BUDJET" : 100
}

I want to read value of CITIES.

*Thanks*,



Save RDD to HDFS using Spark Python API

2016-04-26 Thread Luke Adolph
Hi, all:
Below is my code:

from pyspark import *import re
def getDateByLine(input_str):
str_pattern = '^\d{4}-\d{2}-\d{2}'
pattern = re.compile(str_pattern)
match = pattern.match(input_str)
if match:
return match.group()
else:
return None

file_url = "hdfs://192.168.10.130:9000/dev/test/test.log"
input_file = sc.textFile(file_url)
line = input_file.filter(getDateByLine).map(lambda x: (x[:10], 1))
counts = line.reduceByKey(lambda a,b: a+b)print counts.collect()
counts.saveAsHadoopFile("hdfs://192.168.10.130:9000/dev/output/test",\
"org.apache.hadoop.mapred.SequenceFileOutputFormat")

​

What I confused is the method *saveAsHadoopFile*,I have read the pyspark
API, But I still don’t understand the second arg mean

Below is the output when I run above code:
```

[(u'2016-02-29', 99), (u'2016-03-02', 30)]

---Py4JJavaError
Traceback (most recent call
last) in () 18 counts =
line.reduceByKey(lambda a,b: a+b) 19 print counts.collect()---> 20
counts.saveAsHadoopFile("hdfs://192.168.10.130:9000/dev/output/test",

"org.apache.hadoop.mapred.SequenceFileOutputFormat")

/mydata/softwares/spark-1.6.1/python/pyspark/rdd.pyc in
saveAsHadoopFile(self, path, outputFormatClass, keyClass, valueClass,
keyConverter, valueConverter, conf, compressionCodecClass)   1419
keyClass, valueClass,
1420  keyConverter,
valueConverter,-> 1421
 jconf, compressionCodecClass)   1422
   1423 def saveAsSequenceFile(self, path, compressionCodecClass=None):

/mydata/softwares/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
in __call__(self, *args)811 answer =
self.gateway_client.send_command(command)812 return_value
= get_return_value(--> 813 answer, self.gateway_client,
self.target_id, self.name)814
815 for temp_arg in temp_args:

/mydata/softwares/spark-1.6.1/python/pyspark/sql/utils.pyc in deco(*a,
**kw) 43 def deco(*a, **kw): 44 try:---> 45
 return f(*a, **kw) 46 except
py4j.protocol.Py4JJavaError as e: 47 s =
e.java_exception.toString()
/mydata/softwares/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/protocol.py
in get_return_value(answer, gateway_client, target_id, name)306
 raise Py4JJavaError(307 "An error
occurred while calling {0}{1}{2}.\n".--> 308
format(target_id, ".", name), value)309 else:
310 raise Py4JError(
Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.saveAsHadoopFile.
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output
directory hdfs://192.168.10.130:9000/dev/output/test already exists
at 
org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1179)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1156)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1156)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1156)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1060)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1026)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1026)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1026)
at 
org.apache.spark.api.python.PythonRDD$.saveAsHadoopFile(PythonRDD.scala:753)
at 
org.apache.spark.api.python.PythonRDD.saveAsHadoopFile(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at 

Re: Splitting spark dstream into separate fields

2016-04-26 Thread Mich Talebzadeh
many thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 26 April 2016 at 12:23, Praveen Devarao <praveen...@in.ibm.com> wrote:

> Given that you are not specifying any key explicitly [usually people will
> user the Producer API and have a key value pair inserted] the key will be
> nullso your tuples would look like below
>
> (null, "ID   TIMESTAMP   PRICE")
> (null, "40,20160426-080924,  67.55738301621814598514")
>
> For values...the positions should be 0 indexedhence (referring to your
> invocations) words1 will return value for TIMESTAMP and words2 will return
> value for PRICE
>
> >>I assume this is an array that can be handled as elements of an array as
> well?<<
>
> These are all still under your DStream...you will need to invoke action on
> the DStream to use themfor instance words.foreachRDD(.)
>
> It should be easy for you to just run the streaming program and call print
> on each resulting DStream to understand what data is contained in it and
> decide how to make use of it.
>
> Thanking You
>
> -
> Praveen Devarao
> Spark Technology Centre
> IBM India Software Labs
>
> -
> "Courage doesn't always roar. Sometimes courage is the quiet voice at the
> end of the day saying I will try again"
>
>
>
> From:Mich Talebzadeh <mich.talebza...@gmail.com>
> To:Praveen Devarao/India/IBM@IBMIN, "user @spark" <
> user@spark.apache.org>
> Date:26/04/2016 04:03 pm
> Subject:Re: Splitting spark dstream into separate fields
>
> --
>
>
>
> Thanks Praveen.
>
> With regard to key/value pair. My kafka takes the following rows as input
>
> cat ${IN_FILE} | ${KAFKA_HOME}/bin/kafka-console-producer.sh --broker-list
> rhes564:9092 --topic newtopic
>
> That ${IN_FILE} is the source of prices (1000  as follows
>
> ID   TIMESTAMP   PRICE
> 40, 20160426-080924,  67.55738301621814598514
>
> So tuples would be like below?
>
>  (1,"ID")
> (2, "TIMESTAMP")
> (3, "PRICE")
>
> For values
> val words1 = lines.map(_.split(',').view(1))
> val words2 = lines.map(_.split(',').view(2))
> val words3 = lines.map(_.split(',').view(3))
>
> So word1 will return value of ID, word2 will return value of TIMESTAMP and
> word3 will return value of PRICE?
>
> I assume this is an array that can be handled as elements of an array as
> well?
>
> Regards
>
> Dr Mich Talebzadeh
>
> LinkedIn
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw*
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>
> *http://talebzadehmich.wordpress.com*
> <http://talebzadehmich.wordpress.com/>
>
>
>
> On 26 April 2016 at 11:11, Praveen Devarao <*praveen...@in.ibm.com*
> <praveen...@in.ibm.com>> wrote:
> Hi Mich,
>
> >>
> val lines = dstream.map(_._2)
>
> This maps the record into components? Is that the correct
> understanding of it
> <<
>
> Not sure what you refer to when said record into components. The
> above function is basically giving you the tuple (key/value pair) that you
> would have inserted into Kafka. say my Kafka producer puts data as
>
> 1=>"abc"
> 2 => "def"
>
> Then the above map would give you tuples as below
>
> (1,"abc")
> (2,"abc")
>
> >>
> The following splits the line into comma separated fields.
>
> val words = lines.map(_.split(',').view(2))
> <<
> Right, basically the value portion of your kafka data is being
> handled here
>
> >>
> val words = lines.map(_.split(',').view(2))
>
> I am interested in column three So view(2) returns the
> value.
>
> I have also seen other ways like
>
> val words = lines.map(_.split(',').map(line => (line(0),
> (line(1),line(2) ...
> <<
>
> The split operation is returnin

Re: Splitting spark dstream into separate fields

2016-04-26 Thread Praveen Devarao
Given that you are not specifying any key explicitly [usually people will 
user the Producer API and have a key value pair inserted] the key will be 
nullso your tuples would look like below

(null, "ID   TIMESTAMP   PRICE")
(null, "40,20160426-080924,  67.55738301621814598514")

For values...the positions should be 0 indexedhence (referring to your 
invocations) words1 will return value for TIMESTAMP and words2 will return 
value for PRICE

>>I assume this is an array that can be handled as elements of an array as 
well?<<

These are all still under your DStream...you will need to invoke action on 
the DStream to use themfor instance words.foreachRDD(.)

It should be easy for you to just run the streaming program and call print 
on each resulting DStream to understand what data is contained in it and 
decide how to make use of it.

Thanking You
-
Praveen Devarao
Spark Technology Centre
IBM India Software Labs
-
"Courage doesn't always roar. Sometimes courage is the quiet voice at the 
end of the day saying I will try again"



From:   Mich Talebzadeh <mich.talebza...@gmail.com>
To: Praveen Devarao/India/IBM@IBMIN, "user @spark" 
<user@spark.apache.org>
Date:   26/04/2016 04:03 pm
Subject:Re: Splitting spark dstream into separate fields



Thanks Praveen.

With regard to key/value pair. My kafka takes the following rows as input

cat ${IN_FILE} | ${KAFKA_HOME}/bin/kafka-console-producer.sh --broker-list 
rhes564:9092 --topic newtopic

That ${IN_FILE} is the source of prices (1000  as follows

ID   TIMESTAMP   PRICE
40, 20160426-080924,  67.55738301621814598514

So tuples would be like below?

 (1,"ID")
(2, "TIMESTAMP")
(3, "PRICE")

For values
val words1 = lines.map(_.split(',').view(1))
val words2 = lines.map(_.split(',').view(2))
val words3 = lines.map(_.split(',').view(3))

So word1 will return value of ID, word2 will return value of TIMESTAMP and 
word3 will return value of PRICE?

I assume this is an array that can be handled as elements of an array as 
well?

Regards

Dr Mich Talebzadeh
 
LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 
http://talebzadehmich.wordpress.com
 

On 26 April 2016 at 11:11, Praveen Devarao <praveen...@in.ibm.com> wrote:
Hi Mich,

>>
val lines = dstream.map(_._2)

This maps the record into components? Is that the correct 
understanding of it
<<

Not sure what you refer to when said record into components. The 
above function is basically giving you the tuple (key/value pair) that you 
would have inserted into Kafka. say my Kafka producer puts data as 

1=>"abc"
2 => "def"

Then the above map would give you tuples as below

(1,"abc")
(2,"abc")

>> 
The following splits the line into comma separated fields. 


val words = lines.map(_.split(',').view(2))
<<
Right, basically the value portion of your kafka data is being 
handled here

>>
val words = lines.map(_.split(',').view(2))

I am interested in column three So view(2) returns the 
value.

I have also seen other ways like

val words = lines.map(_.split(',').map(line => (line(0), 
(line(1),line(2) ...
<<

The split operation is returning back an array of String [a 
immutable StringLike collection]calling the view method is creating a 
IndexedSeqView on the iterable while as in the second way you are 
iterating through it accessing the elements directly via the index 
position [line(0), line(1) ]. You would have to decide what is best for 
your use case based on evaluations should be lazy or immediate [see 
references below].

References: 
http://www.scala-lang.org/files/archive/api/2.10.6/index.html#scala.collection.mutable.IndexedSeqLike
, 
http://www.scala-lang.org/files/archive/api/2.10.6/index.html#scala.collection.mutable.IndexedSeqView



Thanking You
-
Praveen Devarao
Spark Technology Centre
IBM India Software Labs
-
"Courage doesn't always roar. Sometimes courage is the quiet voice at the 
end of the day saying I will try again"



From:Mich Talebzadeh <mich.talebza...@gmail.com>
To:"user @spark" <use

Unsubscribe

2016-04-26 Thread Andrew Heinrichs
Unsubscribe
On Apr 22, 2016 3:21 PM, "Mich Talebzadeh" 
wrote:

>
> Hi,
>
> Anyone know which jar file has  import org.apache.spark.internal.Logging?
>
> I tried *spark-core_2.10-1.5.1.jar *
>
> but does not seem to work
>
> scala> import org.apache.spark.internal.Logging
>
> :57: error: object internal is not a member of package
> org.apache.spark
>  import org.apache.spark.internal.Logging
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>


Re: Splitting spark dstream into separate fields

2016-04-26 Thread Mich Talebzadeh
Thanks Praveen.

With regard to key/value pair. My kafka takes the following rows as input

cat ${IN_FILE} | ${KAFKA_HOME}/bin/kafka-console-producer.sh --broker-list
rhes564:9092 --topic newtopic

That ${IN_FILE} is the source of prices (1000  as follows

ID   TIMESTAMP   PRICE
40, 20160426-080924,  67.55738301621814598514

So tuples would be like below?

 (1,"ID")
(2, "TIMESTAMP")
(3, "PRICE")

For values
val words1 = lines.map(_.split(',').view(1))
val words2 = lines.map(_.split(',').view(2))
val words3 = lines.map(_.split(',').view(3))

So word1 will return value of ID, word2 will return value of TIMESTAMP and
word3 will return value of PRICE?

I assume this is an array that can be handled as elements of an array as
well?

Regards

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 26 April 2016 at 11:11, Praveen Devarao <praveen...@in.ibm.com> wrote:

> Hi Mich,
>
> >>
> val lines = dstream.map(_._2)
>
> This maps the record into components? Is that the correct
> understanding of it
> <<
>
> Not sure what you refer to when said record into components. The
> above function is basically giving you the tuple (key/value pair) that you
> would have inserted into Kafka. say my Kafka producer puts data as
>
> 1=>"abc"
> 2 => "def"
>
> Then the above map would give you tuples as below
>
> (1,"abc")
> (2,"abc")
>
> >>
> The following splits the line into comma separated fields.
>
> val words = lines.map(_.split(',').view(2))
> <<
> Right, basically the value portion of your kafka data is being
> handled here
>
> >>
> val words = lines.map(_.split(',').view(2))
>
> I am interested in column three So view(2) returns the
> value.
>
> I have also seen other ways like
>
> val words = lines.map(_.split(',').map(line => (line(0),
> (line(1),line(2) ...
> <<
>
> The split operation is returning back an array of String [a
> immutable *StringLike *collection]calling the view method is creating
> a *IndexedSeqView *on the iterable while as in the second way you are
> iterating through it accessing the elements directly via the index position
> [line(0), line(1) ]. You would have to decide what is best for your use
> case based on evaluations should be lazy or immediate [see references
> below].
>
> References:
> *http://www.scala-lang.org/files/archive/api/2.10.6/index.html#scala.collection.mutable.IndexedSeqLike*
> <http://www.scala-lang.org/files/archive/api/2.10.6/index.html#scala.collection.mutable.IndexedSeqLike>*,
> *
> *http://www.scala-lang.org/files/archive/api/2.10.6/index.html#scala.collection.mutable.IndexedSeqView*
> <http://www.scala-lang.org/files/archive/api/2.10.6/index.html#scala.collection.mutable.IndexedSeqView>
>
>
> Thanking You
>
> -
> Praveen Devarao
> Spark Technology Centre
> IBM India Software Labs
>
> -
> "Courage doesn't always roar. Sometimes courage is the quiet voice at the
> end of the day saying I will try again"
>
>
>
> From:Mich Talebzadeh <mich.talebza...@gmail.com>
> To:"user @spark" <user@spark.apache.org>
> Date:26/04/2016 12:58 pm
> Subject:Splitting spark dstream into separate fields
> --
>
>
>
> Hi,
>
> Is there any optimum way of splitting a dstream into components?
>
> I am doing Spark streaming and this the dstream I get
>
> val dstream = KafkaUtils.createDirectStream[String, String, StringDecoder,
> StringDecoder](ssc, kafkaParams, topics)
>
> Now that dstream consists of 10,00 price lines per second like below
>
> ID, TIMESTAMP, PRICE
> 31,20160426-080924,93.53608929178084896656
>
> The columns are separated by commas/
>
> Now couple of questions:
>
> val lines = dstream.map(_._2)
>
> This maps the record into components? Is that the correct understanding of
> it
>
> The following splits the line into comma separated fields.
>
> val words = lines.map(_.split(',').v

RE: Kafka exception in Apache Spark

2016-04-26 Thread Michel Hubert
This is production.

Van: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Verzonden: dinsdag 26 april 2016 12:01
Aan: Michel Hubert 
CC: user@spark.apache.org
Onderwerp: Re: Kafka exception in Apache Spark

Hi Michael,

Is this production or test?


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 26 April 2016 at 09:07, Michel Hubert 
> wrote:
Hi,


I use a Kafka direct stream approach.
My Spark application was running ok.
This morning we upgraded to CDH 5.7.0
And when I re-started my Spark application I get exceptions.

It seems a problem with the direct stream approach.
Any ideas how to fix this?



User class threw exception: org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 3 in stage 0.0 failed 4 times, most recent failure: Lost 
task 3.3 in stage 0.0 (TID 26, 
bfravicsvr81439-cld.opentsp.com): 
java.lang.IllegalArgumentException
at java.nio.Buffer.limit(Buffer.java:267)
at kafka.api.FetchResponsePartitionData$.readFrom(FetchResponse.scala:38)
at kafka.api.TopicData$$anonfun$1.apply(FetchResponse.scala:100)
at kafka.api.TopicData$$anonfun$1.apply(FetchResponse.scala:98)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.Range.foreach(Range.scala:141)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at kafka.api.TopicData$.readFrom(FetchResponse.scala:98)
at kafka.api.FetchResponse$$anonfun$4.apply(FetchResponse.scala:170)
at kafka.api.FetchResponse$$anonfun$4.apply(FetchResponse.scala:169)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.Range.foreach(Range.scala:141)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at kafka.api.FetchResponse$.readFrom(FetchResponse.scala:169)
at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:135)
at 
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:192)
at 
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:208)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)




Re: Splitting spark dstream into separate fields

2016-04-26 Thread Praveen Devarao
Hi Mich,

>> 
val lines = dstream.map(_._2)

This maps the record into components? Is that the correct 
understanding of it
<<

Not sure what you refer to when said record into components. The 
above function is basically giving you the tuple (key/value pair) that you 
would have inserted into Kafka. say my Kafka producer puts data as 

1=>"abc"
2 => "def"

Then the above map would give you tuples as below

(1,"abc")
(2,"abc")

>> 
The following splits the line into comma separated fields. 


val words = lines.map(_.split(',').view(2))
<<
Right, basically the value portion of your kafka data is being 
handled here

>>
val words = lines.map(_.split(',').view(2))

I am interested in column three So view(2) returns the 
value.
 
I have also seen other ways like

val words = lines.map(_.split(',').map(line => (line(0), 
(line(1),line(2) ...
<<

The split operation is returning back an array of String [a 
immutable StringLike collection]calling the view method is creating a 
IndexedSeqView on the iterable while as in the second way you are 
iterating through it accessing the elements directly via the index 
position [line(0), line(1) ]. You would have to decide what is best for 
your use case based on evaluations should be lazy or immediate [see 
references below].

References: 
http://www.scala-lang.org/files/archive/api/2.10.6/index.html#scala.collection.mutable.IndexedSeqLike
 
, 
http://www.scala-lang.org/files/archive/api/2.10.6/index.html#scala.collection.mutable.IndexedSeqView
 

 
 
Thanking You
-
Praveen Devarao
Spark Technology Centre
IBM India Software Labs
-
"Courage doesn't always roar. Sometimes courage is the quiet voice at the 
end of the day saying I will try again"



From:   Mich Talebzadeh 
To: "user @spark" 
Date:   26/04/2016 12:58 pm
Subject:Splitting spark dstream into separate fields



Hi,

Is there any optimum way of splitting a dstream into components?

I am doing Spark streaming and this the dstream I get

val dstream = KafkaUtils.createDirectStream[String, String, StringDecoder, 
StringDecoder](ssc, kafkaParams, topics)

Now that dstream consists of 10,00 price lines per second like below

ID, TIMESTAMP, PRICE
31,20160426-080924,93.53608929178084896656

The columns are separated by commas/

Now couple of questions:

val lines = dstream.map(_._2)

This maps the record into components? Is that the correct understanding of 
it

The following splits the line into comma separated fields. 

val words = lines.map(_.split(',').view(2))

I am interested in column three So view(2) returns the value.

I have also seen other ways like

val words = lines.map(_.split(',').map(line => (line(0), (line(1),line(2) 
...

line(0), line(1) refer to the position of the fields?

Which one is the adopted one or the correct one?

Thanks


Dr Mich Talebzadeh
 
LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 
http://talebzadehmich.wordpress.com
 





Re: Kafka exception in Apache Spark

2016-04-26 Thread Mich Talebzadeh
Hi Michael,

Is this production or test?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 26 April 2016 at 09:07, Michel Hubert  wrote:

> Hi,
>
>
>
>
>
> I use a Kafka direct stream approach.
>
> My Spark application was running ok.
>
> This morning we upgraded to CDH 5.7.0
>
> And when I re-started my Spark application I get exceptions.
>
>
>
> It seems a problem with the direct stream approach.
>
> Any ideas how to fix this?
>
>
>
>
>
>
>
> User class threw exception: org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 3 in stage 0.0 failed 4 times, most recent
> failure: Lost task 3.3 in stage 0.0 (TID 26,
> bfravicsvr81439-cld.opentsp.com): java.lang.IllegalArgumentException
>
> at java.nio.Buffer.limit(Buffer.java:267)
>
> at kafka.api.FetchResponsePartitionData$.readFrom(FetchResponse.scala:38)
>
> at kafka.api.TopicData$$anonfun$1.apply(FetchResponse.scala:100)
>
> at kafka.api.TopicData$$anonfun$1.apply(FetchResponse.scala:98)
>
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
> at scala.collection.immutable.Range.foreach(Range.scala:141)
>
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>
> at kafka.api.TopicData$.readFrom(FetchResponse.scala:98)
>
> at kafka.api.FetchResponse$$anonfun$4.apply(FetchResponse.scala:170)
>
> at kafka.api.FetchResponse$$anonfun$4.apply(FetchResponse.scala:169)
>
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>
> at scala.collection.immutable.Range.foreach(Range.scala:141)
>
> at
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>
> at kafka.api.FetchResponse$.readFrom(FetchResponse.scala:169)
>
> at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:135)
>
> at
> org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:192)
>
> at
> org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:208)
>
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>
> at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
>
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>
> at
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
>


Kafka exception in Apache Spark

2016-04-26 Thread Michel Hubert
Hi,


I use a Kafka direct stream approach.
My Spark application was running ok.
This morning we upgraded to CDH 5.7.0
And when I re-started my Spark application I get exceptions.

It seems a problem with the direct stream approach.
Any ideas how to fix this?



User class threw exception: org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 3 in stage 0.0 failed 4 times, most recent failure: Lost 
task 3.3 in stage 0.0 (TID 26, bfravicsvr81439-cld.opentsp.com): 
java.lang.IllegalArgumentException
at java.nio.Buffer.limit(Buffer.java:267)
at kafka.api.FetchResponsePartitionData$.readFrom(FetchResponse.scala:38)
at kafka.api.TopicData$$anonfun$1.apply(FetchResponse.scala:100)
at kafka.api.TopicData$$anonfun$1.apply(FetchResponse.scala:98)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.Range.foreach(Range.scala:141)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at kafka.api.TopicData$.readFrom(FetchResponse.scala:98)
at kafka.api.FetchResponse$$anonfun$4.apply(FetchResponse.scala:170)
at kafka.api.FetchResponse$$anonfun$4.apply(FetchResponse.scala:169)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.Range.foreach(Range.scala:141)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at kafka.api.FetchResponse$.readFrom(FetchResponse.scala:169)
at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:135)
at 
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:192)
at 
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:208)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



Splitting spark dstream into separate fields

2016-04-26 Thread Mich Talebzadeh
Hi,

Is there any optimum way of splitting a dstream into components?

I am doing Spark streaming and this the dstream I get

val dstream = KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](ssc, kafkaParams, topics)


Now that dstream consists of 10,00 price lines per second like below

ID, TIMESTAMP, PRICE
31,20160426-080924,93.53608929178084896656

The columns are separated by commas/

Now couple of questions:

val lines = dstream.map(_._2)

This maps the record into components? Is that the correct understanding of
it

The following splits the line into comma separated fields.

val words = lines.map(_.split(',').view(2))

I am interested in column three So view(2) returns the value.

I have also seen other ways like

val words = lines.map(_.split(',').map(line => (line(0), (line(1),line(2)
...

line(0), line(1) refer to the position of the fields?

Which one is the adopted one or the correct one?

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


How can I bucketize / group a DataFrame from parquet files?

2016-04-26 Thread Brandon White
I am creating a dataFrame from parquet files. The schema is based on the
parquet files, I do not know it before hand. What I want to do is group the
entire DF into buckets based on a column.

val df = sqlContext.read.parquet("/path/to/files")
val groupedBuckets: DataFrame[String, Array[Rows]] =
df.groupBy($"columnName")

I know this does not work because the DataFrame's groupBy is only used for
aggregate functions. I cannot convert my DataFrame to a DataSet because I
do not have a case class for the DataSet schema. The only thing I can do is
convert the df to an RDD[Rows] and try to deal with the types. This is ugly
and difficult.

Is there any better way? Can I convert a DataFrame to a DataSet without a
predefined case class?

Brandon