date:20170824

Re: unable to stream kafka messages

2017-08-24 Thread cbowden

The exception is telling you precisely what is wrong. The kafka source has a schema of (topic, partition, offset, key, value, timestamp, timestampType). Nothing about those columns makes sense as a tweet. You need to inform spark how to get from bytes to tweet, it doesn't know how you serialized

Re: [Spark Structured Streaming]: truncated Parquet after driver crash or kill

2017-08-24 Thread cbowden

The default spark.sql.streaming.commitProtocolClass is https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ManifestFileCommitProtocol.scala which may or may not be the best suited for all needs. Code deploys could be improved by ensuring

Re: Structured Streaming: multiple sinks

2017-08-24 Thread cbowden

1. would it not be more natural to write processed to kafka and sink processed from kafka to s3? 2a. addBatch is the time Sink#addBatch took as measured by StreamExecution. 2b. getBatch is the time Source#getBatch took as measured by StreamExecution. 3. triggerExecution is effectively end-to-end

Re: [Streaming][Structured Streaming] Understanding dynamic allocation in streaming jobs

2017-08-24 Thread cbowden

You can leverage dynamic resource allocation with structured streaming. Certainly there's an argument trivial jobs won't benefit. Certainly there's an argument important jobs should have fixed resources for stable end to end latency. Few scenarios come to mind with benefits: - I want my

[Spark] Can Apache Spark be used with time series processing?

2017-08-24 Thread Alexandr Porunov

Hello, I am new in Apache Spark. I need to process different time series data (numeric values which depend on time) and react on next actions: 1. Data is changing up or down too fast. 2. Data is changing constantly up or down too long. For example, if the data have changed 30% up or down in the

SWOT Analysis on Apache Spark

2017-08-24 Thread Irfan Kabli

Hi All, I am not sure if the users list is the right list for this query, but I am hoping if this is the wrong forum someone would point me to the right forum. I work for a company which uses proprietary analytical ecosystem. I am evangelising open-source and have been requested by management to

Re: PySpark, Structured Streaming and Kafka

2017-08-24 Thread Brian Wylie

Resolved :) Hi just a loopback on this (thanks for everyone's help). In jupyter notebook the following command works and properly loads in the Kafka jar files. # Spin up a local Spark Session spark = SparkSession.builder.appName('my_awesome')\ .config('spark.jars.packages',

Restarting the SparkContext in pyspark

2017-08-24 Thread Alexander Czech

I'm running a Jupyter-Spark setup and I want to benchmark my cluster with different input parameters. To make sure the enivorment stays the same I'm trying to reset(restart) the SparkContext, here is some code: *temp_result_parquet = os.path.normpath('/home/spark_tmp_parquet')

NoSuchMethodError CatalogTable.copy

2017-08-24 Thread Lionel Luffy

Hi, any one knows how to fix below error? java.lang.NoSuchMethodError:

RE: Joining 2 dataframes, getting result as nested list/structure in dataframe

2017-08-24 Thread JG Perrin

Thanks Michael – this is a great article… very helpful From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Wednesday, August 23, 2017 4:33 PM To: JG Perrin Cc: user@spark.apache.org Subject: Re: Joining 2 dataframes, getting result as nested list/structure in

SparkStreaming connection exception

2017-08-24 Thread Likith_Kailas

I have written a unit test which uses multithreading to start and stop Sparkstreamingjob and kafkaproducer. All the dependencies have been declared in maven pom.xml file. When i run the test, once the all the kafka messages are read and the threads are stopped i continue to get the below

Re: ORC Transaction Table - Spark

2017-08-24 Thread Aviral Agarwal

Are there any plans to include it in the future releases of Spark ? Regards, Aviral Agarwal On Thu, Aug 24, 2017 at 3:11 PM, Akhil Das wrote: > How are you reading the data? Its clearly saying > *java.lang.NumberFormatException: > For input string: "0645253_0001" * > > On

Re: Training A ML Model on a Huge Dataframe

2017-08-24 Thread Yanbo Liang

Hi Sea, Could you let us know which ML algorithm you use? What's the number instances and dimension of your dataset? AFAIK, Spark MLlib can train model with several millions of feature if you configure it correctly. Thanks Yanbo On Thu, Aug 24, 2017 at 7:07 AM, Suzen, Mehmet

Re: How is data desensitization (example: select bank_no from users)?

2017-08-24 Thread Akhil Das

Usually analysts will not have access to data stored in the PCI Zone, you could write the data out to a table for the analysts by masking the sensitive information. Eg: > val mask_udf = udf((info: String) => info.patch(0, "*" * 12, 7)) > val df = sc.parallelize(Seq(("user1",

Re: UI for spark machine learning.

2017-08-24 Thread Akhil Das

How many iterations are you doing on the data? Like Jörn said, you don't necessarily need a billion samples for linear regression. On Tue, Aug 22, 2017 at 6:28 PM, Sea aj wrote: > Jorn, > > My question is not about the model type but instead, the spark capability > on reusing

Re: ORC Transaction Table - Spark

2017-08-24 Thread Akhil Das

How are you reading the data? Its clearly saying *java.lang.NumberFormatException: For input string: "0645253_0001" * On Tue, Aug 22, 2017 at 7:40 PM, Aviral Agarwal wrote: > Hi, > > I am trying to read hive orc transaction table through Spark but I am > getting the

Re: [Spark Streaming] Streaming Dynamic Allocation is broken (at least on YARN)

2017-08-24 Thread Akhil Das

Have you tried setting spark.executor.instances=0 to a positive non-zero value? Also, since its a streaming application set executor cores > 1. On Wed, Aug 23, 2017 at 3:38 AM, Karthik Palaniappan wrote: > I ran the HdfsWordCount example using this command: > >

Re: unable to stream kafka messages

Re: [Spark Structured Streaming]: truncated Parquet after driver crash or kill

Re: Structured Streaming: multiple sinks

Re: [Streaming][Structured Streaming] Understanding dynamic allocation in streaming jobs

[Spark] Can Apache Spark be used with time series processing?

SWOT Analysis on Apache Spark

Re: PySpark, Structured Streaming and Kafka

Restarting the SparkContext in pyspark

NoSuchMethodError CatalogTable.copy

RE: Joining 2 dataframes, getting result as nested list/structure in dataframe

SparkStreaming connection exception

Re: ORC Transaction Table - Spark

Re: Training A ML Model on a Huge Dataframe

Re: How is data desensitization (example: select bank_no from users)?

Re: UI for spark machine learning.

Re: ORC Transaction Table - Spark

Re: [Spark Streaming] Streaming Dynamic Allocation is broken (at least on YARN)

17 matches

Site Navigation

Mail list logo

Footer information