Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread Gourav Sengupta
would it not be like appending lines to the same file in that case? On Tue, Jan 16, 2018 at 4:50 AM, kant kodali wrote: > Got it! What about overwriting the same file instead of appending? > > On Mon, Jan 15, 2018 at 7:47 PM, Gourav Sengupta < > gourav.sengu...@gmail.com>

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread ayan guha
http://spark.apache.org/docs/1.0.0/streaming-programming-guide.html#input-sources On Tue, Jan 16, 2018 at 3:50 PM, kant kodali wrote: > Got it! What about overwriting the same file instead of appending? > > On Mon, Jan 15, 2018 at 7:47 PM, Gourav Sengupta < >

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread kant kodali
Got it! What about overwriting the same file instead of appending? On Mon, Jan 15, 2018 at 7:47 PM, Gourav Sengupta wrote: > What Gerard means is that if you are adding new files in to the same base > path (key) then its fine, but in case you are appending lines to

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread Gourav Sengupta
What Gerard means is that if you are adding new files in to the same base path (key) then its fine, but in case you are appending lines to the same file then changes will not be picked up. Regards, Gourav Sengupta On Tue, Jan 16, 2018 at 12:20 AM, kant kodali wrote: > Hi, >

Re: spark-submit can find python?

2018-01-15 Thread Jeff Zhang
Hi Manuel, Looks like you are using the virtualenv of spark. Virtualenv will create python enviroment in executor. >>> --conf >>> spark.pyspark.virtualenv.bin.path=/home/mansop/hail-test/python-2.7.2/bin/activate \ And you are not making proper configuration, spark.pyspark.virtualenv.bin.path

Re: Broken SQL Visualization?

2018-01-15 Thread Wenchen Fan
Hi, thanks for reporting, can you include the steps to reproduce this bug? On Tue, Jan 16, 2018 at 7:07 AM, Ted Yu wrote: > Did you include any picture ? > > Looks like the picture didn't go thru. > > Please use third party site. > > Thanks > > Original message

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread kant kodali
Hi, I am not sure I understand. any examples ? On Mon, Jan 15, 2018 at 3:45 PM, Gerard Maas wrote: > Hi, > > You can monitor a filesystem directory as streaming source as long as the > files placed there are atomically copied/moved into the directory. > Updating the

RE: spark-submit can find python?

2018-01-15 Thread Manuel Sopena Ballesteros
Apologies, I copied the wrong spark-submit output from running in a cluster. Please find below the right output for the question asked: -bash-4.1$ spark-submit --master yarn \ > --deploy-mode cluster \ > --driver-memory 4g \ > --executor-memory 2g \ > --executor-cores 4 \ >

spark-submit can find python?

2018-01-15 Thread Manuel Sopena Ballesteros
Hi all, I am quite new to spark and need some help troubleshooting the execution of an application running on a spark cluster... My spark environment is deployed using Ambari (HDP), YARM is the resource scheduler and hadoop as file system. The application I am trying to run is a python script

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread Gerard Maas
Hi, You can monitor a filesystem directory as streaming source as long as the files placed there are atomically copied/moved into the directory. Updating the files is not supported. kr, Gerard. On Mon, Jan 15, 2018 at 11:41 PM, kant kodali wrote: > Hi All, > > I am

Re: Broken SQL Visualization?

2018-01-15 Thread Ted Yu
Did you include any picture ? Looks like the picture didn't go thru. Please use third party site.  Thanks Original message From: Tomasz Gawęda Date: 1/15/18 2:07 PM (GMT-08:00) To: d...@spark.apache.org, user@spark.apache.org Subject: Broken SQL

can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread kant kodali
Hi All, I am wondering if HDFS can be a streaming source like Kafka in Spark 2.2.0? For example can I have stream1 reading from Kafka and writing to HDFS and stream2 to read from HDFS and write it back to Kakfa ? such that stream2 will be pulling the latest updates written by stream1. Thanks!

Re: Timestamp changing while writing

2018-01-15 Thread Bryan Cutler
Spark internally stores timestamps as UTC values, so cearteDataFrame will covert from local time zone to UTC. I think there was a Jira to correct parquet output. Are the values you are seeing offset from your local time zone? On Jan 11, 2018 4:49 PM, "sk skk" wrote: >

Re: Inner join with the table itself

2018-01-15 Thread Jacek Laskowski
Hi Michael, scala> spark.version res0: String = 2.4.0-SNAPSHOT scala> val r1 = spark.range(1) r1: org.apache.spark.sql.Dataset[Long] = [id: bigint] scala> r1.as("left").join(r1.as("right")).filter($"left.id" === $"right.id ").show +---+---+ | id| id| +---+---+ | 0| 0| +---+---+ Am I missing

Re: [Spark ML] Positive-Only Training Classification in Scala

2018-01-15 Thread Georg Heiler
I do not know that module, but in literature PUL is the exact term you should look for. Matt Hicks schrieb am Mo., 15. Jan. 2018 um 20:56 Uhr: > Is it fair to assume this is what I need? > https://github.com/ispras/pu4spark > > > > On Mon, Jan 15, 2018 1:55 PM, Georg Heiler

Re: [Spark ML] Positive-Only Training Classification in Scala

2018-01-15 Thread Matt Hicks
Is it fair to assume this is what I need? https://github.com/ispras/pu4spark On Mon, Jan 15, 2018 1:55 PM, Georg Heiler georg.kf.hei...@gmail.com wrote: As far as I know spark does not implement such algorithms. In case the dataset is small

Re: [Spark ML] Positive-Only Training Classification in Scala

2018-01-15 Thread Georg Heiler
As far as I know spark does not implement such algorithms. In case the dataset is small http://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html might be of interest to you. Jörn Franke schrieb am Mo., 15. Jan. 2018 um 20:04 Uhr: > I think you look

Re: [Spark ML] Positive-Only Training Classification in Scala

2018-01-15 Thread Jörn Franke
I think you look more for algorithms for unsupervised learning, eg clustering. Depending on the characteristics different clusters might be created , eg donor or non-donor. Most likely you may find also more clusters (eg would donate but has a disease preventing it or too old). You can verify

Re: 3rd party hadoop input formats for EDI formats

2018-01-15 Thread Jörn Franke
I do not want to make advertisement for certain third party components. Hence, just some food for thought: Python Pandas supports some of those formats (it is not an inputformat though). Some commercial offers just provide etl to convert it into another format supported already by Spark . Then

[Spark ML] Positive-Only Training Classification in Scala

2018-01-15 Thread Matt Hicks
I'm attempting to create a training classification, but only have positive information.  Specifically in this case it is a donor list of users, but I want to use it as training in order to determine classification for new contacts to give probabilities that they will donate. Any insights or links

3rd party hadoop input formats for EDI formats

2018-01-15 Thread Saravanan Nagarajan
Hello All, Ned to research the availability of both open source and commercial libraries to read healthcare EDI formats such as HL7, 835, 837. Each library need to be researched/ranked on several criteria like pricing if commercial, suitability for integration into sagacity, stability of

Re: End of Stream errors in shuffle

2018-01-15 Thread pratyush04
Hi Fernando, There is a limit of 2GB on blocks for shuffle, since you say the job fails while doing shuffle of 200GB data, it might be due to this. These links give more idea about this: http://apache-spark-developers-list.1001551.n3.nabble.com/Re-2GB-limit-for-partitions-td10435.html

Re: Inner join with the table itself

2018-01-15 Thread Michael Shtelma
Hi Jacek & Gengliang, let's take a look at the following query: val pos = spark.read.parquet(prefix + "POSITION.parquet") pos.createOrReplaceTempView("POSITION") spark.sql("SELECT POSITION.POSITION_ID FROM POSITION POSITION JOIN POSITION POSITION1 ON POSITION.POSITION_ID0 =

[Spark DataFrame]: Passing DataFrame to custom method results in NullPointerException

2018-01-15 Thread abdul.h.hussain
Hi, My Spark app is mapping lines from a text file to case classes stored within an RDD. When I run the following code on this rdd: .collect.map(line => if(validate_hostname(line, data_frame)) line).foreach(println) It correctly calls the method validate_hostname by passing the case class and

End of Stream errors in shuffle

2018-01-15 Thread Fernando Pereira
Hi, I'm facing a very strange error that occurs halfway of long execution Spark SQL jobs: 18/01/12 22:14:30 ERROR Utils: Aborting task java.io.EOFException: reached end of stream after reading 0 bytes; 96 bytes expected at org.spark_project.guava.io.ByteStreams.readFully(ByteStreams.java:735) at

Re: Inner join with the table itself

2018-01-15 Thread Gengliang Wang
Hi Michael, You can use `Explain` to see how your query is optimized. https://docs.databricks.com/spark/latest/spark-sql/language-manual/explain.html I believe your query is an actual cross join, which is usually

Re: Inner join with the table itself

2018-01-15 Thread Jacek Laskowski
Hi Michael, -dev +user What's the query? How do you "fool spark"? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams