Re: Returning DataFrame for text file

2017-04-06 Thread Yan Facai
SparkSession.read returns a DataFrameReader.
DataFrameReader supports a series of format, such as csv, json, text as you
mentioned.

check API to find more details:
+ http://spark.apache.org/docs/latest/api/scala/index.html#org
.apache.spark.sql.SparkSession
+ http://spark.apache.org/docs/latest/api/scala/index.html#org
.apache.spark.sql.DataFrameReader




On Thu, Mar 30, 2017 at 2:58 AM, George Obama  wrote:

> Hi,
>
> I saw that the API, either R or Scala, we are returning DataFrame for
> sparkSession.read.text()
>
> What’s the rational behind this?
>
> Regards,
> George
>


Re: Master-Worker communication on Standalone cluster issues

2017-04-06 Thread Yan Facai
1. For worker and master:
spark.worker.timeout 60s
see: http://spark.apache.org/docs/latest/spark-standalone.html

2. For executor and driver:
spark.executor.heartbeatInterval 10s
see: http://spark.apache.org/docs/latest/configuration.html


Please correct me if I'm wrong.



On Thu, Apr 6, 2017 at 5:01 AM, map reduced  wrote:

> Hi,
>
> I was wondering on how often does Worker pings Master to check on Master's
> liveness? Or is it the Master (Resource manager) that pings Workers to
> check on their liveness and if any workers are dead to spawn ? Or is it
> both?
>
> Some info:
> Standalone cluster
> 1 Master - 8core 12Gb
> 32 workers - each 8 core and 8 Gb
>
> My main problem - Here's what happened:
>
> Master M - running with 32 workers
> Worker 1 and 2 died at 03:55:00 - so now the cluster is 30 workers
>
> Worker 1' came up at 03:55:12.000 AM - it connected to M
> Worker 2' came up at 03:55:16.000 AM - it connected to M
>
> Master M *dies* at 03:56.00 AM
> New master NM' comes up at 03:56:30 AM
> Worker 1' and 2' - *DO NOT* connect to NM
> Remaining 30 workers connect to NM.
>
> So NM now has 30 workers.
>
> I was wondering on why those two won't connect to new master NM even
> though master M is dead for sure.
>
> PS:I have a LB setup for Master which means that whenever a new master
> comes in LB will start pointing to new one.
>
> Thanks,
> KP
>
>


Apache Drill vs Spark SQL

2017-04-06 Thread kant kodali
Hi All,

I am very impressed with the work done on Spark SQL however when I have to
pick something to serve real time queries I am in a dilemma for the
following reasons.

1. Even though Spark Sql has logical plans, physical plans and run time
code generation and all that it still doesn't look like the tool to serve
real time queries like we normally do from a database. I tend to think this
is because the queries had to go through job submission first. I don't want
to call this overhead or anything but this is what it seems to do.
comparing this, on the other hand we have the data that we want to serve
sitting on a database where we simply issue an SQL query and get the
response back so for this use case what would be an appropriate tool? I
tend to think its Drill but would like to hear if there are any interesting
arguments.

2. I can see a case for Spark SQL such as queries that need to be expressed
in a iterative fashion. For example doing a graph traversal such BFS, DFS
or say even a simple pre order, in order , post order Traversals on a BST.
All this will be very hard to express on a Declarative syntax like SQL. I
also tend to think Ad-hoc distributed joins (By Ad-hoc I mean one is not
certain about their query patterns) are also better off expressing it in
map-reduce style than say SQL unless one know their query patterns well
ahead such that the possibility of queries that require redistribution is
so low. I am also sure there are plenty of other cases where Spark SQL will
excel but I wanted to see what is good choice to simple serve the data?

Any suggestions are appreciated.

Thanks!


Re: is there a way to persist the lineages generated by spark?

2017-04-06 Thread Jörn Franke
I do think this is the right way, you will have to do testing with test data 
verifying that the expected output of the calculation is the output. 
Even if the logical Plan Is correct your calculation might not be. E.g. There 
can be bugs in Spark, in the UI or (what is very often) the client describes a 
calculation, but in the end the description is wrong.

> On 4. Apr 2017, at 05:19, kant kodali  wrote:
> 
> Hi All,
> 
> I am wondering if there a way to persist the lineages generated by spark 
> underneath? Some of our clients want us to prove if the result of the 
> computation that we are showing on a dashboard is correct and for that If we 
> can show the lineage of transformations that are executed to get to the 
> result then that can be the Q.E.D moment but I am not even sure if this is 
> even possible with spark?
> 
> Thanks,
> kant

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: is there a way to persist the lineages generated by spark?

2017-04-06 Thread Gourav Sengupta
Hi,

I think that every client wants a validation process, but showing lineage
is a approach that they are not asking, and may not be the right way to
prove it.


Regards,
Gourav

On Tue, Apr 4, 2017 at 4:19 AM, kant kodali  wrote:

> Hi All,
>
> I am wondering if there a way to persist the lineages generated by spark
> underneath? Some of our clients want us to prove if the result of the
> computation that we are showing on a dashboard is correct and for that If
> we can show the lineage of transformations that are executed to get to the
> result then that can be the Q.E.D moment but I am not even sure if this is
> even possible with spark?
>
> Thanks,
> kant
>


Re: Spark and Hive connection

2017-04-06 Thread Gourav Sengupta
Hi,

the connection is made to HIVE thrift server.

Regards,
Gourav

On Thu, Apr 6, 2017 at 6:06 AM, infa elance  wrote:

> Hi all,
> When using spark-shell my understanding is spark connects to hive through
> metastore.
> The question i have is does spark connect to metastore , is it JDBC?
>
> Thanks and Regards,
> Ajay.
>


Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-06 Thread Gourav Sengupta
Hi Shyla,

why would you want to schedule a spark job in EC2 instead of EMR?

Regards,
Gourav

On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande 
wrote:

> I want to run a spark batch job maybe hourly on AWS EC2 .  What is the
> easiest way to do this. Thanks
>


Re: distinct query getting stuck at ShuffleBlockFetcherIterator

2017-04-06 Thread Yash Sharma
Hi Ramesh,
Could you share some logs please? pastebin ? dag view ?
Did you check for GC pauses if any.

On Thu, 6 Apr 2017 at 21:55 Ramesh Krishnan  wrote:

> I have a use case of distinct on a dataframe. When i run the application
> is getting stuck at  LINE *ShuffleBlockFetcherIterator: Started 4 remote
> fetches *forever.
>
> Can someone help .
>
>
> Thanks
> Ramesh
>


Re: Error while reading the CSV

2017-04-06 Thread Yash Sharma
Hi Nayan,
I use the --packages with the spark shell and the spark submit. Could you
please try that and let us know:
Command:

spark-submit --packages com.databricks:spark-csv_2.11:1.4.0


On Fri, 7 Apr 2017 at 00:39 nayan sharma  wrote:

> spark version 1.6.2
> scala version 2.10.5
>
> On 06-Apr-2017, at 8:05 PM, Jörn Franke  wrote:
>
> And which version does your Spark cluster use?
>
> On 6. Apr 2017, at 16:11, nayan sharma  wrote:
>
> scalaVersion := “2.10.5"
>
>
>
>
>
> On 06-Apr-2017, at 7:35 PM, Jörn Franke  wrote:
>
> Maybe your Spark is based on scala 2.11, but you compile it for 2.10 or
> the other way around?
>
> On 6. Apr 2017, at 15:54, nayan sharma  wrote:
>
> In addition I am using spark version 1.6.2
> Is there any chance of error coming because of Scala version or
> dependencies are not matching.?I just guessed.
>
> Thanks,
> Nayan
>
>
>
> On 06-Apr-2017, at 7:16 PM, nayan sharma  wrote:
>
> Hi Jorn,
> Thanks for replying.
>
> jar -tf catalyst-data-prepration-assembly-1.0.jar | grep csv
>
> after doing this I have found a lot of classes under
> com/databricks/spark/csv/
>
> do I need to check for any specific class ??
>
> Regards,
> Nayan
>
> On 06-Apr-2017, at 6:42 PM, Jörn Franke  wrote:
>
> Is the library in your assembly jar?
>
> On 6. Apr 2017, at 15:06, nayan sharma  wrote:
>
> Hi All,
> I am getting error while loading CSV file.
>
> val
> datacsv=sqlContext.read.format("com.databricks.spark.csv").option("header",
> "true").load("timeline.csv")
> java.lang.NoSuchMethodError:
> org.apache.commons.csv.CSVFormat.withQuote(Ljava/lang/Character;)Lorg/apache/commons/csv/CSVFormat;
>
>
> I have added the dependencies in sbt file
>
> // Spark Additional Library - CSV Read as DFlibraryDependencies += 
> "com.databricks" %% "spark-csv" % “1.5.0"
>
> *and starting the spark-shell with command*
>
> spark-shell --master yarn-client  --jars
> /opt/packages/-data-prepration/target/scala-2.10/-data-prepration-assembly-1.0.jar
> --name nayan
>
>
>
> Thanks for any help!!
>
>
> Thanks,
> Nayan
>
>
>
>
>
>


Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-06 Thread Yash Sharma
Hi Shyla,
We could suggest based on what you're trying to do exactly. But with the
given information - If you have your spark job ready you could schedule it
via any scheduling framework like Airflow or Celery or Cron based on how
simple/complex you want your work flow to be.

Cheers,
Yash



On Fri, 7 Apr 2017 at 10:04 shyla deshpande 
wrote:

> I want to run a spark batch job maybe hourly on AWS EC2 .  What is the
> easiest way to do this. Thanks
>


What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-06 Thread shyla deshpande
I want to run a spark batch job maybe hourly on AWS EC2 .  What is the
easiest way to do this. Thanks


What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-06 Thread shyla deshpande
I want to run a spark batch job maybe hourly on AWS EC2 .  What is the
easiest way to do this. Thanks


Re: df.count() returns one more count than SELECT COUNT()

2017-04-06 Thread Mohamed Nadjib MAMI
That was the case. Thanks for the quick and clean answer, Hemanth.

*Regards, Grüße, **Cordialement,** Recuerdos, Saluti, προσρήσεις, 问候,
تحياتي.*
*Mohamed Nadjib Mami*
*Research Associate @ Fraunhofer IAIS - PhD Student @ Bonn University*
*About me! *
*LinkedIn *

On Thu, Apr 6, 2017 at 7:33 PM, Hemanth Gudela 
wrote:

> Nulls are excluded with *spark.sql("SELECT count(distinct col) FROM
> Table").show()*
>
> I think it is ANSI SQL behaviour.
>
>
>
> scala> spark.sql("select distinct count(null)").show(false)
>
> +---+
>
> |count(NULL)|
>
> +---+
>
> |0  |
>
> +---+
>
>
>
> scala> spark.sql("select distinct null").count
>
> res1: Long = 1
>
>
>
> Regards,
>
> Hemanth
>
>
>
> *From: *Mohamed Nadjib Mami 
> *Date: *Thursday, 6 April 2017 at 20.29
> *To: *"user@spark.apache.org" 
> *Subject: *df.count() returns one more count than SELECT COUNT()
>
>
>
> *spark.sql("SELECT count(distinct col) FROM Table").show()*
>


Re: df.count() returns one more count than SELECT COUNT()

2017-04-06 Thread Hemanth Gudela
Nulls are excluded with spark.sql("SELECT count(distinct col) FROM 
Table").show()
I think it is ANSI SQL behaviour.

scala> spark.sql("select distinct count(null)").show(false)
+---+
|count(NULL)|
+---+
|0  |
+---+

scala> spark.sql("select distinct null").count
res1: Long = 1

Regards,
Hemanth

From: Mohamed Nadjib Mami 
Date: Thursday, 6 April 2017 at 20.29
To: "user@spark.apache.org" 
Subject: df.count() returns one more count than SELECT COUNT()

spark.sql("SELECT count(distinct col) FROM Table").show()


df.count() returns one more count than SELECT COUNT()

2017-04-06 Thread Mohamed Nadjib Mami

I paste this right from Spark shell (Spark 2.1.0):


/scala> spark.sql("SELECT count(distinct col) FROM Table").show()//
//+-+ //
//|count(DISTINCT col)|//
//+-+//
//|4697|//
//+-+//

//scala> spark.sql("SELECT distinct col FROM Table").count()//
//res8: Long = 4698 /

That is, `dataframe.count()` is returning one more count that the 
in-query `COUNT()` function.


Any explanations?

Cheers,
Mohamed


Is the trigger interval the same as batch interval in structured streaming?

2017-04-06 Thread kant kodali
Hi All,

Is the trigger interval mentioned in this doc

the same as batch interval in structured streaming? For example I have a
long running receiver(not kafka) which sends me a real time stream I want
to use window interval, slide interval of 24 hours to create the Tumbling
window effect but I want to process updates every second.

Thanks!


Re: Spark and Hive connection

2017-04-06 Thread Nicholas Hakobian
Spark connects directly to the Hive metastore service in order to manage
table definitions and locations and such. If you are using the CLI
interfaces and turn on INFO level logging, you can see when you instantiate
a HiveContext that it is connecting to the Hive Metastore and the URL its
using for the connection.

Hope this helps,
Nick

Nicholas Szandor Hakobian, Ph.D.
Senior Data Scientist
Rally Health
nicholas.hakob...@rallyhealth.com


On Wed, Apr 5, 2017 at 10:06 PM, infa elance  wrote:

> Hi all,
> When using spark-shell my understanding is spark connects to hive through
> metastore.
> The question i have is does spark connect to metastore , is it JDBC?
>
> Thanks and Regards,
> Ajay.
>


Re: Error while reading the CSV

2017-04-06 Thread nayan sharma
spark version 1.6.2
scala version 2.10.5

> On 06-Apr-2017, at 8:05 PM, Jörn Franke  wrote:
> 
> And which version does your Spark cluster use?
> 
> On 6. Apr 2017, at 16:11, nayan sharma  > wrote:
> 
>> scalaVersion := “2.10.5"
>> 
>> 
>> 
>> 
>>> On 06-Apr-2017, at 7:35 PM, Jörn Franke >> > wrote:
>>> 
>>> Maybe your Spark is based on scala 2.11, but you compile it for 2.10 or the 
>>> other way around?
>>> 
>>> On 6. Apr 2017, at 15:54, nayan sharma >> > wrote:
>>> 
 In addition I am using spark version 1.6.2
 Is there any chance of error coming because of Scala version or 
 dependencies are not matching.?I just guessed.
 
 Thanks,
 Nayan
 
  
> On 06-Apr-2017, at 7:16 PM, nayan sharma  > wrote:
> 
> Hi Jorn,
> Thanks for replying.
> 
> jar -tf catalyst-data-prepration-assembly-1.0.jar | grep csv
> 
> after doing this I have found a lot of classes under 
> com/databricks/spark/csv/
> 
> do I need to check for any specific class ??
> 
> Regards,
> Nayan
>> On 06-Apr-2017, at 6:42 PM, Jörn Franke > > wrote:
>> 
>> Is the library in your assembly jar?
>> 
>> On 6. Apr 2017, at 15:06, nayan sharma > > wrote:
>> 
>>> Hi All,
>>> I am getting error while loading CSV file.
>>> 
>>> val 
>>> datacsv=sqlContext.read.format("com.databricks.spark.csv").option("header",
>>>  "true").load("timeline.csv")
>>> java.lang.NoSuchMethodError: 
>>> org.apache.commons.csv.CSVFormat.withQuote(Ljava/lang/Character;)Lorg/apache/commons/csv/CSVFormat;
>>> 
>>> 
>>> I have added the dependencies in sbt file 
>>> // Spark Additional Library - CSV Read as DF
>>> libraryDependencies += "com.databricks" %% "spark-csv" % “1.5.0"
>>> and starting the spark-shell with command
>>> 
>>> spark-shell --master yarn-client  --jars 
>>> /opt/packages/-data-prepration/target/scala-2.10/-data-prepration-assembly-1.0.jar
>>>  --name nayan 
>>> 
>>> 
>>> 
>>> Thanks for any help!!
>>> 
>>> 
>>> Thanks,
>>> Nayan
> 
 
>> 



Re: Error while reading the CSV

2017-04-06 Thread Jörn Franke
And which version does your Spark cluster use?

> On 6. Apr 2017, at 16:11, nayan sharma  wrote:
> 
> scalaVersion := “2.10.5"
> 
> 
> 
> 
>> On 06-Apr-2017, at 7:35 PM, Jörn Franke  wrote:
>> 
>> Maybe your Spark is based on scala 2.11, but you compile it for 2.10 or the 
>> other way around?
>> 
>>> On 6. Apr 2017, at 15:54, nayan sharma  wrote:
>>> 
>>> In addition I am using spark version 1.6.2
>>> Is there any chance of error coming because of Scala version or 
>>> dependencies are not matching.?I just guessed.
>>> 
>>> Thanks,
>>> Nayan
>>> 
>>>  
 On 06-Apr-2017, at 7:16 PM, nayan sharma  wrote:
 
 Hi Jorn,
 Thanks for replying.
 
 jar -tf catalyst-data-prepration-assembly-1.0.jar | grep csv
 
 after doing this I have found a lot of classes under 
 com/databricks/spark/csv/
 
 do I need to check for any specific class ??
 
 Regards,
 Nayan
> On 06-Apr-2017, at 6:42 PM, Jörn Franke  wrote:
> 
> Is the library in your assembly jar?
> 
>> On 6. Apr 2017, at 15:06, nayan sharma  wrote:
>> 
>> Hi All,
>> I am getting error while loading CSV file.
>> 
>> val 
>> datacsv=sqlContext.read.format("com.databricks.spark.csv").option("header",
>>  "true").load("timeline.csv")
>> java.lang.NoSuchMethodError: 
>> org.apache.commons.csv.CSVFormat.withQuote(Ljava/lang/Character;)Lorg/apache/commons/csv/CSVFormat;
>> 
>> 
>> I have added the dependencies in sbt file 
>> // Spark Additional Library - CSV Read as DF
>> libraryDependencies += "com.databricks" %% "spark-csv" % “1.5.0"
>> and starting the spark-shell with command
>> 
>> spark-shell --master yarn-client  --jars 
>> /opt/packages/-data-prepration/target/scala-2.10/-data-prepration-assembly-1.0.jar
>>  --name nayan 
>> 
>> 
>> 
>> Thanks for any help!!
>> 
>> 
>> Thanks,
>> Nayan
 
>>> 
> 


Re: Error while reading the CSV

2017-04-06 Thread nayan sharma
scalaVersion := “2.10.5"




> On 06-Apr-2017, at 7:35 PM, Jörn Franke  wrote:
> 
> Maybe your Spark is based on scala 2.11, but you compile it for 2.10 or the 
> other way around?
> 
> On 6. Apr 2017, at 15:54, nayan sharma  > wrote:
> 
>> In addition I am using spark version 1.6.2
>> Is there any chance of error coming because of Scala version or dependencies 
>> are not matching.?I just guessed.
>> 
>> Thanks,
>> Nayan
>> 
>>  
>>> On 06-Apr-2017, at 7:16 PM, nayan sharma >> > wrote:
>>> 
>>> Hi Jorn,
>>> Thanks for replying.
>>> 
>>> jar -tf catalyst-data-prepration-assembly-1.0.jar | grep csv
>>> 
>>> after doing this I have found a lot of classes under 
>>> com/databricks/spark/csv/
>>> 
>>> do I need to check for any specific class ??
>>> 
>>> Regards,
>>> Nayan
 On 06-Apr-2017, at 6:42 PM, Jörn Franke > wrote:
 
 Is the library in your assembly jar?
 
 On 6. Apr 2017, at 15:06, nayan sharma > wrote:
 
> Hi All,
> I am getting error while loading CSV file.
> 
> val 
> datacsv=sqlContext.read.format("com.databricks.spark.csv").option("header",
>  "true").load("timeline.csv")
> java.lang.NoSuchMethodError: 
> org.apache.commons.csv.CSVFormat.withQuote(Ljava/lang/Character;)Lorg/apache/commons/csv/CSVFormat;
> 
> 
> I have added the dependencies in sbt file 
> // Spark Additional Library - CSV Read as DF
> libraryDependencies += "com.databricks" %% "spark-csv" % “1.5.0"
> and starting the spark-shell with command
> 
> spark-shell --master yarn-client  --jars 
> /opt/packages/-data-prepration/target/scala-2.10/-data-prepration-assembly-1.0.jar
>  --name nayan 
> 
> 
> 
> Thanks for any help!!
> 
> 
> Thanks,
> Nayan
>>> 
>> 



Re: Error while reading the CSV

2017-04-06 Thread Jörn Franke
Maybe your Spark is based on scala 2.11, but you compile it for 2.10 or the 
other way around?

> On 6. Apr 2017, at 15:54, nayan sharma  wrote:
> 
> In addition I am using spark version 1.6.2
> Is there any chance of error coming because of Scala version or dependencies 
> are not matching.?I just guessed.
> 
> Thanks,
> Nayan
> 
>  
>> On 06-Apr-2017, at 7:16 PM, nayan sharma  wrote:
>> 
>> Hi Jorn,
>> Thanks for replying.
>> 
>> jar -tf catalyst-data-prepration-assembly-1.0.jar | grep csv
>> 
>> after doing this I have found a lot of classes under 
>> com/databricks/spark/csv/
>> 
>> do I need to check for any specific class ??
>> 
>> Regards,
>> Nayan
>>> On 06-Apr-2017, at 6:42 PM, Jörn Franke  wrote:
>>> 
>>> Is the library in your assembly jar?
>>> 
 On 6. Apr 2017, at 15:06, nayan sharma  wrote:
 
 Hi All,
 I am getting error while loading CSV file.
 
 val 
 datacsv=sqlContext.read.format("com.databricks.spark.csv").option("header",
  "true").load("timeline.csv")
 java.lang.NoSuchMethodError: 
 org.apache.commons.csv.CSVFormat.withQuote(Ljava/lang/Character;)Lorg/apache/commons/csv/CSVFormat;
 
 
 I have added the dependencies in sbt file 
 // Spark Additional Library - CSV Read as DF
 libraryDependencies += "com.databricks" %% "spark-csv" % “1.5.0"
 and starting the spark-shell with command
 
 spark-shell --master yarn-client  --jars 
 /opt/packages/-data-prepration/target/scala-2.10/-data-prepration-assembly-1.0.jar
  --name nayan 
 
 
 
 Thanks for any help!!
 
 
 Thanks,
 Nayan
>> 
> 


Re: Error while reading the CSV

2017-04-06 Thread nayan sharma
In addition I am using spark version 1.6.2
Is there any chance of error coming because of Scala version or dependencies 
are not matching.?I just guessed.

Thanks,
Nayan

 
> On 06-Apr-2017, at 7:16 PM, nayan sharma  wrote:
> 
> Hi Jorn,
> Thanks for replying.
> 
> jar -tf catalyst-data-prepration-assembly-1.0.jar | grep csv
> 
> after doing this I have found a lot of classes under 
> com/databricks/spark/csv/
> 
> do I need to check for any specific class ??
> 
> Regards,
> Nayan
>> On 06-Apr-2017, at 6:42 PM, Jörn Franke > > wrote:
>> 
>> Is the library in your assembly jar?
>> 
>> On 6. Apr 2017, at 15:06, nayan sharma > > wrote:
>> 
>>> Hi All,
>>> I am getting error while loading CSV file.
>>> 
>>> val 
>>> datacsv=sqlContext.read.format("com.databricks.spark.csv").option("header", 
>>> "true").load("timeline.csv")
>>> java.lang.NoSuchMethodError: 
>>> org.apache.commons.csv.CSVFormat.withQuote(Ljava/lang/Character;)Lorg/apache/commons/csv/CSVFormat;
>>> 
>>> 
>>> I have added the dependencies in sbt file 
>>> // Spark Additional Library - CSV Read as DF
>>> libraryDependencies += "com.databricks" %% "spark-csv" % “1.5.0"
>>> and starting the spark-shell with command
>>> 
>>> spark-shell --master yarn-client  --jars 
>>> /opt/packages/-data-prepration/target/scala-2.10/-data-prepration-assembly-1.0.jar
>>>  --name nayan 
>>> 
>>> 
>>> 
>>> Thanks for any help!!
>>> 
>>> 
>>> Thanks,
>>> Nayan
> 



Re: Error while reading the CSV

2017-04-06 Thread nayan sharma
Hi Jorn,
Thanks for replying.

jar -tf catalyst-data-prepration-assembly-1.0.jar | grep csv

after doing this I have found a lot of classes under 
com/databricks/spark/csv/

do I need to check for any specific class ??

Regards,
Nayan
> On 06-Apr-2017, at 6:42 PM, Jörn Franke  wrote:
> 
> Is the library in your assembly jar?
> 
> On 6. Apr 2017, at 15:06, nayan sharma  > wrote:
> 
>> Hi All,
>> I am getting error while loading CSV file.
>> 
>> val 
>> datacsv=sqlContext.read.format("com.databricks.spark.csv").option("header", 
>> "true").load("timeline.csv")
>> java.lang.NoSuchMethodError: 
>> org.apache.commons.csv.CSVFormat.withQuote(Ljava/lang/Character;)Lorg/apache/commons/csv/CSVFormat;
>> 
>> 
>> I have added the dependencies in sbt file 
>> // Spark Additional Library - CSV Read as DF
>> libraryDependencies += "com.databricks" %% "spark-csv" % “1.5.0"
>> and starting the spark-shell with command
>> 
>> spark-shell --master yarn-client  --jars 
>> /opt/packages/-data-prepration/target/scala-2.10/-data-prepration-assembly-1.0.jar
>>  --name nayan 
>> 
>> 
>> 
>> Thanks for any help!!
>> 
>> 
>> Thanks,
>> Nayan



Re: Reading ASN.1 files in Spark

2017-04-06 Thread Yong Zhang
Spark can read any file, as long as you can provide it the Hadoop InputFormat 
implementation.


Did you try this guy's example?


http://awcoleman.blogspot.com/2014/07/processing-asn1-call-detail-records.html

[http://lh6.googleusercontent.com/-Yrre7Enx3TI/AAI/Abo/QNJEjH6MX0o/s80-c/photo.jpg]

Processing ASN.1 Call Detail Records with Hadoop (using 
...
awcoleman.blogspot.com
Processing ASN.1 Call Detail Records with Hadoop (using Bouncy Castle) Part 3



Yong




From: vincent gromakowski 
Sent: Thursday, April 6, 2017 5:24 AM
To: Hamza HACHANI
Cc: user@spark.apache.org
Subject: Re: Reading ASN.1 files in Spark

I would also be interested...

2017-04-06 11:09 GMT+02:00 Hamza HACHANI 
>:
Does any body have a spark code example where he is reading ASN.1 files ?
Thx

Best regards
Hamza



Re: Error while reading the CSV

2017-04-06 Thread Jörn Franke
Is the library in your assembly jar?

> On 6. Apr 2017, at 15:06, nayan sharma  wrote:
> 
> Hi All,
> I am getting error while loading CSV file.
> 
> val 
> datacsv=sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").load("timeline.csv")
> java.lang.NoSuchMethodError: 
> org.apache.commons.csv.CSVFormat.withQuote(Ljava/lang/Character;)Lorg/apache/commons/csv/CSVFormat;
> 
> 
> I have added the dependencies in sbt file 
> // Spark Additional Library - CSV Read as DF
> libraryDependencies += "com.databricks" %% "spark-csv" % “1.5.0"
> and starting the spark-shell with command
> 
> spark-shell --master yarn-client  --jars 
> /opt/packages/-data-prepration/target/scala-2.10/-data-prepration-assembly-1.0.jar
>  --name nayan 
> 
> 
> 
> Thanks for any help!!
> 
> 
> Thanks,
> Nayan


Error while reading the CSV

2017-04-06 Thread nayan sharma
Hi All,
I am getting error while loading CSV file.

val datacsv=sqlContext.read.format("com.databricks.spark.csv").option("header", 
"true").load("timeline.csv")
java.lang.NoSuchMethodError: 
org.apache.commons.csv.CSVFormat.withQuote(Ljava/lang/Character;)Lorg/apache/commons/csv/CSVFormat;


I have added the dependencies in sbt file 
// Spark Additional Library - CSV Read as DF
libraryDependencies += "com.databricks" %% "spark-csv" % “1.5.0"
and starting the spark-shell with command

spark-shell --master yarn-client  --jars 
/opt/packages/-data-prepration/target/scala-2.10/-data-prepration-assembly-1.0.jar
 --name nayan 



Thanks for any help!!


Thanks,
Nayan

distinct query getting stuck at ShuffleBlockFetcherIterator

2017-04-06 Thread Ramesh Krishnan
I have a use case of distinct on a dataframe. When i run the application is
getting stuck at  LINE *ShuffleBlockFetcherIterator: Started 4 remote
fetches *forever.

Can someone help .


Thanks
Ramesh


Re: How does partitioning happen for binary files in spark ?

2017-04-06 Thread Jay
The code that you see in github is for version 2.1. For versions below that
the default partitions for binary files is set to 2 which you can change by
using the minPartitions value. I am not sure starting 2.1 how the
minPartitions column will work because as you said the field is completely
ignored.

Thanks,
Jayadeep

On Thu, Apr 6, 2017 at 3:43 PM, ashwini anand  wrote:

> By looking into the source code, I found that for textFile(), the
> partitioning is computed by the computeSplitSize() function in
> FileInputFormat class. This function takes into consideration the
> minPartitions value passed by user. As per my understanding , the same
> thing for binaryFiles() is computed by the setMinPartitions() function of
> PortableDataStream class. This setMinPartitions() function completely
> ignores the minPartitions value passed by user. However I find that in my
> application somehow the partition varies based on the minPartition value in
> case of binaryFiles() too. I have no idea how this is happening. Please
> help me understand how the partitioning happens in case of binaryFiles().
>
> source code for setMinPartitions() is as below: def setMinPartitions(sc:
> SparkContext, context: JobContext, minPartitions: Int) { val
> defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
> val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES) val
> defaultParallelism = sc.defaultParallelism val files =
> listStatus(context).asScala val totalBytes = 
> files.filterNot(_.isDirectory).map(_.getLen
> + openCostInBytes).sum val bytesPerCore = totalBytes / defaultParallelism
> val maxSplitSize = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes,
> bytesPerCore)) super.setMaxSplitSize(maxSplitSize) }
> --
> View this message in context: How does partitioning happen for binary
> files in spark ?
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>


Re: How does partitioning happen for binary files in spark ?

2017-04-06 Thread Jay
The code that you see in github is for version 2.1. For versions below that
the default partitions for binary files is set to 2 which you can change by
using the minPartitions value. I am not sure starting 2.1 how the
minPartitions column will work because as you said the field is completely
ignored.

Thanks,
Jayadeep

On Thu, Apr 6, 2017 at 3:43 PM, ashwini anand  wrote:

> By looking into the source code, I found that for textFile(), the
> partitioning is computed by the computeSplitSize() function in
> FileInputFormat class. This function takes into consideration the
> minPartitions value passed by user. As per my understanding , the same
> thing for binaryFiles() is computed by the setMinPartitions() function of
> PortableDataStream class. This setMinPartitions() function completely
> ignores the minPartitions value passed by user. However I find that in my
> application somehow the partition varies based on the minPartition value in
> case of binaryFiles() too. I have no idea how this is happening. Please
> help me understand how the partitioning happens in case of binaryFiles().
>
> source code for setMinPartitions() is as below: def setMinPartitions(sc:
> SparkContext, context: JobContext, minPartitions: Int) { val
> defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
> val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES) val
> defaultParallelism = sc.defaultParallelism val files =
> listStatus(context).asScala val totalBytes = 
> files.filterNot(_.isDirectory).map(_.getLen
> + openCostInBytes).sum val bytesPerCore = totalBytes / defaultParallelism
> val maxSplitSize = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes,
> bytesPerCore)) super.setMaxSplitSize(maxSplitSize) }
> --
> View this message in context: How does partitioning happen for binary
> files in spark ?
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>


How does partitioning happen for binary files in spark ?

2017-04-06 Thread ashwini anand
By looking into the source code, I found that for textFile(), the
partitioning is computed by the computeSplitSize() function in
FileInputFormat class. This function takes into consideration the
minPartitions value passed by user. As per my understanding , the same thing
for binaryFiles() is computed by the setMinPartitions() function of
PortableDataStream class. This setMinPartitions() function completely
ignores the minPartitions value passed by user. However I find that in my
application somehow the partition varies based on the minPartition value in
case of binaryFiles() too. I have no idea how this is happening.Please help
me understand how the partitioning happens in case of binaryFiles().

source code for setMinPartitions() is as below:def setMinPartitions(sc:
SparkContext, context: JobContext, minPartitions: Int) {val
defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)   
val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)val
defaultParallelism = sc.defaultParallelismval files =
listStatus(context).asScalaval totalBytes =
files.filterNot(_.isDirectory).map(_.getLen + openCostInBytes).sumval
bytesPerCore = totalBytes / defaultParallelismval maxSplitSize =
Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))   
super.setMaxSplitSize(maxSplitSize)  } 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-does-partitioning-happen-for-binary-files-in-spark-tp28575.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Reading ASN.1 files in Spark

2017-04-06 Thread vincent gromakowski
I would also be interested...

2017-04-06 11:09 GMT+02:00 Hamza HACHANI :

> Does any body have a spark code example where he is reading ASN.1 files ?
> Thx
>
> Best regards
> Hamza
>


Reading ASN.1 files in Spark

2017-04-06 Thread Hamza HACHANI
Does any body have a spark code example where he is reading ASN.1 files ?
Thx

Best regards
Hamza


Re: scala test is unable to initialize spark context.

2017-04-06 Thread Jeff Zhang
Seems it is caused by your log4j file

*Caused by: java.lang.IllegalStateException: FileNamePattern [-.log]
does not contain a valid date format specifier*




于2017年4月6日周四 下午4:03写道:

> Hi All ,
>
>
>
>I am just trying to use scala test for testing a small spark code . But
> spark context is not getting initialized , while I am running test file .
>
> I have given code, pom and exception I am getting in mail , please help me
> to understand what mistake I am doing , so that
>
> Spark context is not getting initialized
>
>
>
> *Code:-*
>
>
>
> *import *org.apache.log4j.LogManager
> *import *org.apache.spark.SharedSparkContext
> *import *org.scalatest.FunSuite
> *import *org.apache.spark.{SparkContext, SparkConf}
>
>
>
>
> */**  * Created by PSwain on 4/5/2017.   */ **class *Test *extends *FunSuite
> *with *SharedSparkContext  {
>
>
>   test(*"test initializing spark context"*) {
> *val *list = *List*(1, 2, 3, 4)
> *val *rdd = sc.parallelize(list)
> assert(list.length === rdd.count())
>   }
> }
>
>
>
> *POM File:-*
>
>
>
> * *?>*<*project **xmlns=*
> *"http://maven.apache.org/POM/4.0.0 "  
>**xmlns:**xsi**=*
> *"http://www.w3.org/2001/XMLSchema-instance 
> " 
> **xsi**:schemaLocation=**"http://maven.apache.org/POM/4.0.0 
>  
> http://maven.apache.org/xsd/maven-4.0.0.xsd 
> "*>
> <*modelVersion*>4.0.0
>
> <*groupId*>tesing.loging
> <*artifactId*>logging
> <*version*>1.0-SNAPSHOT
>
>
> <*repositories*>
> <*repository*>
> <*id*>central
> <*name*>central
> <*url*>http://repo1.maven.org/maven/
> 
> 
>
> <*dependencies*>
> <*dependency*>
> <*groupId*>org.apache.spark
> <*artifactId*>spark-core_2.10
> <*version*>1.6.0
> <*type*>test-jar
>
>
> 
> <*dependency*>
> <*groupId*>org.apache.spark
> <*artifactId*>spark-sql_2.10
> <*version*>1.6.0
> 
>
> <*dependency*>
> <*groupId*>org.scalatest
> <*artifactId*>scalatest_2.10
> <*version*>2.2.6
> 
>
> <*dependency*>
> <*groupId*>org.apache.spark
> <*artifactId*>spark-hive_2.10
> <*version*>1.5.0
> <*scope*>provided
> 
> <*dependency*>
> <*groupId*>com.databricks
> <*artifactId*>spark-csv_2.10
> <*version*>1.3.0
> 
> <*dependency*>
> <*groupId*>com.rxcorp.bdf.logging
> <*artifactId*>loggingframework
> <*version*>1.0-SNAPSHOT
> 
> <*dependency*>
> <*groupId*>mysql
> <*artifactId*>mysql-connector-java
> <*version*>5.1.6
> <*scope*>provided
> 
>
> **<*dependency*>
> <*groupId*>org.scala-lang
> <*artifactId*>scala-library
> <*version*>2.10.5
> <*scope*>compile
> <*optional*>true
> 
>
> <*dependency*>
> <*groupId*>org.scalatest
> <*artifactId*>scalatest
> <*version*>1.4.RC2
> 
>
> <*dependency*>
> <*groupId*>log4j
> <*artifactId*>log4j
> <*version*>1.2.17
> 
>
> <*dependency*>
> <*groupId*>org.scala-lang
> <*artifactId*>scala-compiler
> <*version*>2.10.5
> <*scope*>compile
> <*optional*>true
> 
>
> **
> <*build*>
> <*sourceDirectory*>src/main/scala
> <*plugins*>
> <*plugin*>
> <*artifactId*>maven-assembly-plugin
> <*version*>2.2.1
> <*configuration*>
> <*descriptorRefs*>
> 
> <*descriptorRef*>jar-with-dependencies
> 
> 
> <*executions*>
> <*execution*>
> <*id*>make-assembly
> <*phase*>package
> <*goals*>
> <*goal*>single
> 
> 
> 
> 
> <*plugin*>
> <*groupId*>net.alchim31.maven
> <*artifactId*>scala-maven-plugin
> <*version*>3.2.0
> <*executions*>
> <*execution*>
> <*goals*>
> <*goal*>compile
> <*goal*>testCompile
> 
> 
> 
> <*configuration*>
> <*sourceDir*>src/main/scala
>
> <*jvmArgs*>
>

scala test is unable to initialize spark context.

2017-04-06 Thread PSwain
Hi All ,

   I am just trying to use scala test for testing a small spark code . But 
spark context is not getting initialized , while I am running test file .
I have given code, pom and exception I am getting in mail , please help me to 
understand what mistake I am doing , so that
Spark context is not getting initialized

Code:-

import org.apache.log4j.LogManager
import org.apache.spark.SharedSparkContext
import org.scalatest.FunSuite
import org.apache.spark.{SparkContext, SparkConf}

/**
 * Created by PSwain on 4/5/2017.
  */
class Test extends FunSuite with SharedSparkContext  {


  test("test initializing spark context") {
val list = List(1, 2, 3, 4)
val rdd = sc.parallelize(list)
assert(list.length === rdd.count())
  }
}

POM File:-



http://maven.apache.org/POM/4.0.0;
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd;>
4.0.0

tesing.loging
logging
1.0-SNAPSHOT




central
central
http://repo1.maven.org/maven/





org.apache.spark
spark-core_2.10
1.6.0
test-jar




org.apache.spark
spark-sql_2.10
1.6.0



org.scalatest
scalatest_2.10
2.2.6



org.apache.spark
spark-hive_2.10
1.5.0
provided


com.databricks
spark-csv_2.10
1.3.0


com.rxcorp.bdf.logging
loggingframework
1.0-SNAPSHOT


mysql
mysql-connector-java
5.1.6
provided



org.scala-lang
scala-library
2.10.5
compile
true



org.scalatest
scalatest
1.4.RC2



log4j
log4j
1.2.17



org.scala-lang
scala-compiler
2.10.5
compile
true




src/main/scala


maven-assembly-plugin
2.2.1


jar-with-dependencies




make-assembly
package

single





net.alchim31.maven
scala-maven-plugin
3.2.0



compile
testCompile




src/main/scala


-Xms64m
-Xmx1024m












Exception:-



An exception or error caused a run to abort.

java.lang.ExceptionInInitializerError

 at org.apache.spark.Logging$class.initializeLogging(Logging.scala:121)

 at 
org.apache.spark.Logging$class.initializeIfNecessary(Logging.scala:106)

 at org.apache.spark.Logging$class.log(Logging.scala:50)

 at org.apache.spark.SparkContext.log(SparkContext.scala:79)

 at org.apache.spark.Logging$class.logInfo(Logging.scala:58)

 at org.apache.spark.SparkContext.logInfo(SparkContext.scala:79)

 at org.apache.spark.SparkContext.(SparkContext.scala:211)

 at org.apache.spark.SparkContext.(SparkContext.scala:147)

 at 
org.apache.spark.SharedSparkContext$class.beforeAll(SharedSparkContext.scala:33)

 at Test.beforeAll(Test.scala:10)

 at 
org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)

 at Test.beforeAll(Test.scala:10)

 at 
org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)

 at Test.run(Test.scala:10)

 at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:55)

 at 
org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2563)

 at 
org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2557)

 at scala.collection.immutable.List.foreach(List.scala:318)

 at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:2557)

 at 
org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1044)

 at 

use UTF-16 decode in pyspark streaming

2017-04-06 Thread Yogesh Vyas
Hi,

I am trying to decode the binary data using UTF-16 decode in Kafka consumer
using spark streaming. But it is giving error:
TypeError: 'str' object is not callable

I am doing it in following way:

kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer",
{topic: 1},valueDecoder="utfTo16")

def utfTo16(msg):
  return msg.decode("utf-16")


Please suggest if I am doing it right or not??


Regards,
Yogesh