date:20180115

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread Gourav Sengupta

would it not be like appending lines to the same file in that case? On Tue, Jan 16, 2018 at 4:50 AM, kant kodali wrote: > Got it! What about overwriting the same file instead of appending? > > On Mon, Jan 15, 2018 at 7:47 PM, Gourav Sengupta < > gourav.sengu...@gmail.com>

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread ayan guha

http://spark.apache.org/docs/1.0.0/streaming-programming-guide.html#input-sources On Tue, Jan 16, 2018 at 3:50 PM, kant kodali wrote: > Got it! What about overwriting the same file instead of appending? > > On Mon, Jan 15, 2018 at 7:47 PM, Gourav Sengupta < >

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread kant kodali

Got it! What about overwriting the same file instead of appending? On Mon, Jan 15, 2018 at 7:47 PM, Gourav Sengupta wrote: > What Gerard means is that if you are adding new files in to the same base > path (key) then its fine, but in case you are appending lines to

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread Gourav Sengupta

What Gerard means is that if you are adding new files in to the same base path (key) then its fine, but in case you are appending lines to the same file then changes will not be picked up. Regards, Gourav Sengupta On Tue, Jan 16, 2018 at 12:20 AM, kant kodali wrote: > Hi, >

Re: spark-submit can find python?

2018-01-15 Thread Jeff Zhang

Hi Manuel, Looks like you are using the virtualenv of spark. Virtualenv will create python enviroment in executor. >>> --conf >>> spark.pyspark.virtualenv.bin.path=/home/mansop/hail-test/python-2.7.2/bin/activate \ And you are not making proper configuration, spark.pyspark.virtualenv.bin.path

Re: Broken SQL Visualization?

2018-01-15 Thread Wenchen Fan

Hi, thanks for reporting, can you include the steps to reproduce this bug? On Tue, Jan 16, 2018 at 7:07 AM, Ted Yu wrote: > Did you include any picture ? > > Looks like the picture didn't go thru. > > Please use third party site. > > Thanks > > Original message

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread kant kodali

Hi, I am not sure I understand. any examples ? On Mon, Jan 15, 2018 at 3:45 PM, Gerard Maas wrote: > Hi, > > You can monitor a filesystem directory as streaming source as long as the > files placed there are atomically copied/moved into the directory. > Updating the

RE: spark-submit can find python?

2018-01-15 Thread Manuel Sopena Ballesteros

Apologies, I copied the wrong spark-submit output from running in a cluster. Please find below the right output for the question asked: -bash-4.1$ spark-submit --master yarn \ > --deploy-mode cluster \ > --driver-memory 4g \ > --executor-memory 2g \ > --executor-cores 4 \ >

spark-submit can find python?

2018-01-15 Thread Manuel Sopena Ballesteros

Hi all, I am quite new to spark and need some help troubleshooting the execution of an application running on a spark cluster... My spark environment is deployed using Ambari (HDP), YARM is the resource scheduler and hadoop as file system. The application I am trying to run is a python script

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread Gerard Maas

Hi, You can monitor a filesystem directory as streaming source as long as the files placed there are atomically copied/moved into the directory. Updating the files is not supported. kr, Gerard. On Mon, Jan 15, 2018 at 11:41 PM, kant kodali wrote: > Hi All, > > I am

Re: Broken SQL Visualization?

2018-01-15 Thread Ted Yu

Did you include any picture ? Looks like the picture didn't go thru. Please use third party site. Thanks Original message From: Tomasz Gawęda Date: 1/15/18 2:07 PM (GMT-08:00) To: d...@spark.apache.org, user@spark.apache.org Subject: Broken SQL

can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread kant kodali

Hi All, I am wondering if HDFS can be a streaming source like Kafka in Spark 2.2.0? For example can I have stream1 reading from Kafka and writing to HDFS and stream2 to read from HDFS and write it back to Kakfa ? such that stream2 will be pulling the latest updates written by stream1. Thanks!

Re: Timestamp changing while writing

2018-01-15 Thread Bryan Cutler

Spark internally stores timestamps as UTC values, so cearteDataFrame will covert from local time zone to UTC. I think there was a Jira to correct parquet output. Are the values you are seeing offset from your local time zone? On Jan 11, 2018 4:49 PM, "sk skk" wrote: >

Re: Inner join with the table itself

2018-01-15 Thread Jacek Laskowski

Hi Michael, scala> spark.version res0: String = 2.4.0-SNAPSHOT scala> val r1 = spark.range(1) r1: org.apache.spark.sql.Dataset[Long] = [id: bigint] scala> r1.as("left").join(r1.as("right")).filter($"left.id" === $"right.id ").show +---+---+ | id| id| +---+---+ | 0| 0| +---+---+ Am I missing

Re: [Spark ML] Positive-Only Training Classification in Scala

2018-01-15 Thread Georg Heiler

I do not know that module, but in literature PUL is the exact term you should look for. Matt Hicks schrieb am Mo., 15. Jan. 2018 um 20:56 Uhr: > Is it fair to assume this is what I need? > https://github.com/ispras/pu4spark > > > > On Mon, Jan 15, 2018 1:55 PM, Georg Heiler

Re: [Spark ML] Positive-Only Training Classification in Scala

2018-01-15 Thread Matt Hicks

Is it fair to assume this is what I need? https://github.com/ispras/pu4spark On Mon, Jan 15, 2018 1:55 PM, Georg Heiler georg.kf.hei...@gmail.com wrote: As far as I know spark does not implement such algorithms. In case the dataset is small

Re: [Spark ML] Positive-Only Training Classification in Scala

2018-01-15 Thread Georg Heiler

As far as I know spark does not implement such algorithms. In case the dataset is small http://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html might be of interest to you. Jörn Franke schrieb am Mo., 15. Jan. 2018 um 20:04 Uhr: > I think you look

Re: [Spark ML] Positive-Only Training Classification in Scala

2018-01-15 Thread Jörn Franke

I think you look more for algorithms for unsupervised learning, eg clustering. Depending on the characteristics different clusters might be created , eg donor or non-donor. Most likely you may find also more clusters (eg would donate but has a disease preventing it or too old). You can verify

Re: 3rd party hadoop input formats for EDI formats

2018-01-15 Thread Jörn Franke

I do not want to make advertisement for certain third party components. Hence, just some food for thought: Python Pandas supports some of those formats (it is not an inputformat though). Some commercial offers just provide etl to convert it into another format supported already by Spark . Then

[Spark ML] Positive-Only Training Classification in Scala

2018-01-15 Thread Matt Hicks

I'm attempting to create a training classification, but only have positive information. Specifically in this case it is a donor list of users, but I want to use it as training in order to determine classification for new contacts to give probabilities that they will donate. Any insights or links

3rd party hadoop input formats for EDI formats

2018-01-15 Thread Saravanan Nagarajan

Hello All, Ned to research the availability of both open source and commercial libraries to read healthcare EDI formats such as HL7, 835, 837. Each library need to be researched/ranked on several criteria like pricing if commercial, suitability for integration into sagacity, stability of

Re: End of Stream errors in shuffle

2018-01-15 Thread pratyush04

Hi Fernando, There is a limit of 2GB on blocks for shuffle, since you say the job fails while doing shuffle of 200GB data, it might be due to this. These links give more idea about this: http://apache-spark-developers-list.1001551.n3.nabble.com/Re-2GB-limit-for-partitions-td10435.html

Re: Inner join with the table itself

2018-01-15 Thread Michael Shtelma

Hi Jacek & Gengliang, let's take a look at the following query: val pos = spark.read.parquet(prefix + "POSITION.parquet") pos.createOrReplaceTempView("POSITION") spark.sql("SELECT POSITION.POSITION_ID FROM POSITION POSITION JOIN POSITION POSITION1 ON POSITION.POSITION_ID0 =

[Spark DataFrame]: Passing DataFrame to custom method results in NullPointerException

2018-01-15 Thread abdul.h.hussain

Hi, My Spark app is mapping lines from a text file to case classes stored within an RDD. When I run the following code on this rdd: .collect.map(line => if(validate_hostname(line, data_frame)) line).foreach(println) It correctly calls the method validate_hostname by passing the case class and

End of Stream errors in shuffle

2018-01-15 Thread Fernando Pereira

Hi, I'm facing a very strange error that occurs halfway of long execution Spark SQL jobs: 18/01/12 22:14:30 ERROR Utils: Aborting task java.io.EOFException: reached end of stream after reading 0 bytes; 96 bytes expected at org.spark_project.guava.io.ByteStreams.readFully(ByteStreams.java:735) at

Re: Inner join with the table itself

2018-01-15 Thread Gengliang Wang

Hi Michael, You can use `Explain` to see how your query is optimized. https://docs.databricks.com/spark/latest/spark-sql/language-manual/explain.html I believe your query is an actual cross join, which is usually

Re: Inner join with the table itself

2018-01-15 Thread Jacek Laskowski

Hi Michael, -dev +user What's the query? How do you "fool spark"? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

Re: spark-submit can find python?

Re: Broken SQL Visualization?

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

RE: spark-submit can find python?

spark-submit can find python?

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

Re: Broken SQL Visualization?

can HDFS be a streaming source like Kafka in Spark 2.2.0?

Re: Timestamp changing while writing

Re: Inner join with the table itself

Re: [Spark ML] Positive-Only Training Classification in Scala

Re: [Spark ML] Positive-Only Training Classification in Scala

Re: [Spark ML] Positive-Only Training Classification in Scala

Re: [Spark ML] Positive-Only Training Classification in Scala

Re: 3rd party hadoop input formats for EDI formats

[Spark ML] Positive-Only Training Classification in Scala

3rd party hadoop input formats for EDI formats

Re: End of Stream errors in shuffle

Re: Inner join with the table itself

[Spark DataFrame]: Passing DataFrame to custom method results in NullPointerException

End of Stream errors in shuffle

Re: Inner join with the table itself

Re: Inner join with the table itself

27 matches

Site Navigation

Mail list logo

Footer information