Re: Returning DataFrame for text file

2017-04-06 Thread Yan Facai
SparkSession.read returns a DataFrameReader. DataFrameReader supports a series of format, such as csv, json, text as you mentioned. check API to find more details: + http://spark.apache.org/docs/latest/api/scala/index.html#org .apache.spark.sql.SparkSession +

Re: Master-Worker communication on Standalone cluster issues

2017-04-06 Thread Yan Facai
1. For worker and master: spark.worker.timeout 60s see: http://spark.apache.org/docs/latest/spark-standalone.html 2. For executor and driver: spark.executor.heartbeatInterval 10s see: http://spark.apache.org/docs/latest/configuration.html Please correct me if I'm wrong. On Thu, Apr 6, 2017

Apache Drill vs Spark SQL

2017-04-06 Thread kant kodali
Hi All, I am very impressed with the work done on Spark SQL however when I have to pick something to serve real time queries I am in a dilemma for the following reasons. 1. Even though Spark Sql has logical plans, physical plans and run time code generation and all that it still doesn't look

Re: is there a way to persist the lineages generated by spark?

2017-04-06 Thread Jörn Franke
I do think this is the right way, you will have to do testing with test data verifying that the expected output of the calculation is the output. Even if the logical Plan Is correct your calculation might not be. E.g. There can be bugs in Spark, in the UI or (what is very often) the client

Re: is there a way to persist the lineages generated by spark?

2017-04-06 Thread Gourav Sengupta
Hi, I think that every client wants a validation process, but showing lineage is a approach that they are not asking, and may not be the right way to prove it. Regards, Gourav On Tue, Apr 4, 2017 at 4:19 AM, kant kodali wrote: > Hi All, > > I am wondering if there a way

Re: Spark and Hive connection

2017-04-06 Thread Gourav Sengupta
Hi, the connection is made to HIVE thrift server. Regards, Gourav On Thu, Apr 6, 2017 at 6:06 AM, infa elance wrote: > Hi all, > When using spark-shell my understanding is spark connects to hive through > metastore. > The question i have is does spark connect to

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-06 Thread Gourav Sengupta
Hi Shyla, why would you want to schedule a spark job in EC2 instead of EMR? Regards, Gourav On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande wrote: > I want to run a spark batch job maybe hourly on AWS EC2 . What is the > easiest way to do this. Thanks >

Re: distinct query getting stuck at ShuffleBlockFetcherIterator

2017-04-06 Thread Yash Sharma
Hi Ramesh, Could you share some logs please? pastebin ? dag view ? Did you check for GC pauses if any. On Thu, 6 Apr 2017 at 21:55 Ramesh Krishnan wrote: > I have a use case of distinct on a dataframe. When i run the application > is getting stuck at LINE

Re: Error while reading the CSV

2017-04-06 Thread Yash Sharma
Hi Nayan, I use the --packages with the spark shell and the spark submit. Could you please try that and let us know: Command: spark-submit --packages com.databricks:spark-csv_2.11:1.4.0 On Fri, 7 Apr 2017 at 00:39 nayan sharma wrote: > spark version 1.6.2 > scala

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-06 Thread Yash Sharma
Hi Shyla, We could suggest based on what you're trying to do exactly. But with the given information - If you have your spark job ready you could schedule it via any scheduling framework like Airflow or Celery or Cron based on how simple/complex you want your work flow to be. Cheers, Yash On

What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-06 Thread shyla deshpande
I want to run a spark batch job maybe hourly on AWS EC2 . What is the easiest way to do this. Thanks

What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-06 Thread shyla deshpande
I want to run a spark batch job maybe hourly on AWS EC2 . What is the easiest way to do this. Thanks

Re: df.count() returns one more count than SELECT COUNT()

2017-04-06 Thread Mohamed Nadjib MAMI
That was the case. Thanks for the quick and clean answer, Hemanth. *Regards, Grüße, **Cordialement,** Recuerdos, Saluti, προσρήσεις, 问候, تحياتي.* *Mohamed Nadjib Mami* *Research Associate @ Fraunhofer IAIS - PhD Student @ Bonn University* *About me!

Re: df.count() returns one more count than SELECT COUNT()

2017-04-06 Thread Hemanth Gudela
Nulls are excluded with spark.sql("SELECT count(distinct col) FROM Table").show() I think it is ANSI SQL behaviour. scala> spark.sql("select distinct count(null)").show(false) +---+ |count(NULL)| +---+ |0 | +---+ scala> spark.sql("select distinct null").count

df.count() returns one more count than SELECT COUNT()

2017-04-06 Thread Mohamed Nadjib Mami
I paste this right from Spark shell (Spark 2.1.0): /scala> spark.sql("SELECT count(distinct col) FROM Table").show()// //+-+ // //|count(DISTINCT col)|// //+-+// //|4697|// //+-+// //scala>

Is the trigger interval the same as batch interval in structured streaming?

2017-04-06 Thread kant kodali
Hi All, Is the trigger interval mentioned in this doc the same as batch interval in structured streaming? For example I have a long running receiver(not kafka) which sends me a real time stream I want to use window

Re: Spark and Hive connection

2017-04-06 Thread Nicholas Hakobian
Spark connects directly to the Hive metastore service in order to manage table definitions and locations and such. If you are using the CLI interfaces and turn on INFO level logging, you can see when you instantiate a HiveContext that it is connecting to the Hive Metastore and the URL its using

Re: Error while reading the CSV

2017-04-06 Thread nayan sharma
spark version 1.6.2 scala version 2.10.5 > On 06-Apr-2017, at 8:05 PM, Jörn Franke wrote: > > And which version does your Spark cluster use? > > On 6. Apr 2017, at 16:11, nayan sharma > wrote: > >> scalaVersion

Re: Error while reading the CSV

2017-04-06 Thread Jörn Franke
And which version does your Spark cluster use? > On 6. Apr 2017, at 16:11, nayan sharma wrote: > > scalaVersion := “2.10.5" > > > > >> On 06-Apr-2017, at 7:35 PM, Jörn Franke wrote: >> >> Maybe your Spark is based on scala 2.11, but you

Re: Error while reading the CSV

2017-04-06 Thread nayan sharma
scalaVersion := “2.10.5" > On 06-Apr-2017, at 7:35 PM, Jörn Franke wrote: > > Maybe your Spark is based on scala 2.11, but you compile it for 2.10 or the > other way around? > > On 6. Apr 2017, at 15:54, nayan sharma

Re: Error while reading the CSV

2017-04-06 Thread Jörn Franke
Maybe your Spark is based on scala 2.11, but you compile it for 2.10 or the other way around? > On 6. Apr 2017, at 15:54, nayan sharma wrote: > > In addition I am using spark version 1.6.2 > Is there any chance of error coming because of Scala version or dependencies

Re: Error while reading the CSV

2017-04-06 Thread nayan sharma
In addition I am using spark version 1.6.2 Is there any chance of error coming because of Scala version or dependencies are not matching.?I just guessed. Thanks, Nayan > On 06-Apr-2017, at 7:16 PM, nayan sharma wrote: > > Hi Jorn, > Thanks for replying. > > jar -tf

Re: Error while reading the CSV

2017-04-06 Thread nayan sharma
Hi Jorn, Thanks for replying. jar -tf catalyst-data-prepration-assembly-1.0.jar | grep csv after doing this I have found a lot of classes under com/databricks/spark/csv/ do I need to check for any specific class ?? Regards, Nayan > On 06-Apr-2017, at 6:42 PM, Jörn Franke

Re: Reading ASN.1 files in Spark

2017-04-06 Thread Yong Zhang
Spark can read any file, as long as you can provide it the Hadoop InputFormat implementation. Did you try this guy's example? http://awcoleman.blogspot.com/2014/07/processing-asn1-call-detail-records.html

Re: Error while reading the CSV

2017-04-06 Thread Jörn Franke
Is the library in your assembly jar? > On 6. Apr 2017, at 15:06, nayan sharma wrote: > > Hi All, > I am getting error while loading CSV file. > > val > datacsv=sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").load("timeline.csv") >

Error while reading the CSV

2017-04-06 Thread nayan sharma
Hi All, I am getting error while loading CSV file. val datacsv=sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("timeline.csv") java.lang.NoSuchMethodError: org.apache.commons.csv.CSVFormat.withQuote(Ljava/lang/Character;)Lorg/apache/commons/csv/CSVFormat; I

distinct query getting stuck at ShuffleBlockFetcherIterator

2017-04-06 Thread Ramesh Krishnan
I have a use case of distinct on a dataframe. When i run the application is getting stuck at LINE *ShuffleBlockFetcherIterator: Started 4 remote fetches *forever. Can someone help . Thanks Ramesh

Re: How does partitioning happen for binary files in spark ?

2017-04-06 Thread Jay
The code that you see in github is for version 2.1. For versions below that the default partitions for binary files is set to 2 which you can change by using the minPartitions value. I am not sure starting 2.1 how the minPartitions column will work because as you said the field is completely

Re: How does partitioning happen for binary files in spark ?

2017-04-06 Thread Jay
The code that you see in github is for version 2.1. For versions below that the default partitions for binary files is set to 2 which you can change by using the minPartitions value. I am not sure starting 2.1 how the minPartitions column will work because as you said the field is completely

How does partitioning happen for binary files in spark ?

2017-04-06 Thread ashwini anand
By looking into the source code, I found that for textFile(), the partitioning is computed by the computeSplitSize() function in FileInputFormat class. This function takes into consideration the minPartitions value passed by user. As per my understanding , the same thing for binaryFiles() is

Re: Reading ASN.1 files in Spark

2017-04-06 Thread vincent gromakowski
I would also be interested... 2017-04-06 11:09 GMT+02:00 Hamza HACHANI : > Does any body have a spark code example where he is reading ASN.1 files ? > Thx > > Best regards > Hamza >

Reading ASN.1 files in Spark

2017-04-06 Thread Hamza HACHANI
Does any body have a spark code example where he is reading ASN.1 files ? Thx Best regards Hamza

Re: scala test is unable to initialize spark context.

2017-04-06 Thread Jeff Zhang
Seems it is caused by your log4j file *Caused by: java.lang.IllegalStateException: FileNamePattern [-.log] does not contain a valid date format specifier* 于2017年4月6日周四 下午4:03写道: > Hi All , > > > >I am just trying to use scala test for testing a small spark code .

scala test is unable to initialize spark context.

2017-04-06 Thread PSwain
Hi All , I am just trying to use scala test for testing a small spark code . But spark context is not getting initialized , while I am running test file . I have given code, pom and exception I am getting in mail , please help me to understand what mistake I am doing , so that Spark context

use UTF-16 decode in pyspark streaming

2017-04-06 Thread Yogesh Vyas
Hi, I am trying to decode the binary data using UTF-16 decode in Kafka consumer using spark streaming. But it is giving error: TypeError: 'str' object is not callable I am doing it in following way: kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: