Re: ORC Transaction Table - Spark

2017-08-24 Thread Aviral Agarwal
Are there any plans to include it in the future releases of Spark ? Regards, Aviral Agarwal On Thu, Aug 24, 2017 at 3:11 PM, Akhil Das wrote: > How are you reading the data? Its clearly saying > *java.lang.NumberFormatException: > For input string: "0645253_0001" * > > On

SparkStreaming connection exception

2017-08-24 Thread Likith_Kailas
I have written a unit test which uses multithreading to start and stop Sparkstreamingjob and kafkaproducer. All the dependencies have been declared in maven pom.xml file. When i run the test, once the all the kafka messages are read and the threads are stopped i continue to get the below

RE: Joining 2 dataframes, getting result as nested list/structure in dataframe

2017-08-24 Thread JG Perrin
Thanks Michael – this is a great article… very helpful From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Wednesday, August 23, 2017 4:33 PM To: JG Perrin Cc: user@spark.apache.org Subject: Re: Joining 2 dataframes, getting result as nested list/structure in

Restarting the SparkContext in pyspark

2017-08-24 Thread Alexander Czech
I'm running a Jupyter-Spark setup and I want to benchmark my cluster with different input parameters. To make sure the enivorment stays the same I'm trying to reset(restart) the SparkContext, here is some code: *temp_result_parquet = os.path.normpath('/home/spark_tmp_parquet')

Re: How is data desensitization (example: select bank_no from users)?

2017-08-24 Thread Akhil Das
Usually analysts will not have access to data stored in the PCI Zone, you could write the data out to a table for the analysts by masking the sensitive information. Eg: > val mask_udf = udf((info: String) => info.patch(0, "*" * 12, 7)) > val df = sc.parallelize(Seq(("user1",

NoSuchMethodError CatalogTable.copy

2017-08-24 Thread Lionel Luffy
Hi, any one knows how to fix below error? java.lang.NoSuchMethodError:

Re: Training A ML Model on a Huge Dataframe

2017-08-24 Thread Yanbo Liang
Hi Sea, Could you let us know which ML algorithm you use? What's the number instances and dimension of your dataset? AFAIK, Spark MLlib can train model with several millions of feature if you configure it correctly. Thanks Yanbo On Thu, Aug 24, 2017 at 7:07 AM, Suzen, Mehmet

Re: PySpark, Structured Streaming and Kafka

2017-08-24 Thread Brian Wylie
Resolved :) Hi just a loopback on this (thanks for everyone's help). In jupyter notebook the following command works and properly loads in the Kafka jar files. # Spin up a local Spark Session spark = SparkSession.builder.appName('my_awesome')\ .config('spark.jars.packages',

SWOT Analysis on Apache Spark

2017-08-24 Thread Irfan Kabli
Hi All, I am not sure if the users list is the right list for this query, but I am hoping if this is the wrong forum someone would point me to the right forum. I work for a company which uses proprietary analytical ecosystem. I am evangelising open-source and have been requested by management to

[Spark] Can Apache Spark be used with time series processing?

2017-08-24 Thread Alexandr Porunov
Hello, I am new in Apache Spark. I need to process different time series data (numeric values which depend on time) and react on next actions: 1. Data is changing up or down too fast. 2. Data is changing constantly up or down too long. For example, if the data have changed 30% up or down in the

Re: unable to stream kafka messages

2017-08-24 Thread cbowden
The exception is telling you precisely what is wrong. The kafka source has a schema of (topic, partition, offset, key, value, timestamp, timestampType). Nothing about those columns makes sense as a tweet. You need to inform spark how to get from bytes to tweet, it doesn't know how you serialized

Re: [Spark Streaming] Streaming Dynamic Allocation is broken (at least on YARN)

2017-08-24 Thread Akhil Das
Have you tried setting spark.executor.instances=0 to a positive non-zero value? Also, since its a streaming application set executor cores > 1. On Wed, Aug 23, 2017 at 3:38 AM, Karthik Palaniappan wrote: > I ran the HdfsWordCount example using this command: > >

Re: ORC Transaction Table - Spark

2017-08-24 Thread Akhil Das
How are you reading the data? Its clearly saying *java.lang.NumberFormatException: For input string: "0645253_0001" * On Tue, Aug 22, 2017 at 7:40 PM, Aviral Agarwal wrote: > Hi, > > I am trying to read hive orc transaction table through Spark but I am > getting the

Re: UI for spark machine learning.

2017-08-24 Thread Akhil Das
How many iterations are you doing on the data? Like Jörn said, you don't necessarily need a billion samples for linear regression. On Tue, Aug 22, 2017 at 6:28 PM, Sea aj wrote: > Jorn, > > My question is not about the model type but instead, the spark capability > on reusing

Re: [Streaming][Structured Streaming] Understanding dynamic allocation in streaming jobs

2017-08-24 Thread cbowden
You can leverage dynamic resource allocation with structured streaming. Certainly there's an argument trivial jobs won't benefit. Certainly there's an argument important jobs should have fixed resources for stable end to end latency. Few scenarios come to mind with benefits: - I want my

Re: Structured Streaming: multiple sinks

2017-08-24 Thread cbowden
1. would it not be more natural to write processed to kafka and sink processed from kafka to s3? 2a. addBatch is the time Sink#addBatch took as measured by StreamExecution. 2b. getBatch is the time Source#getBatch took as measured by StreamExecution. 3. triggerExecution is effectively end-to-end

Re: [Spark Structured Streaming]: truncated Parquet after driver crash or kill

2017-08-24 Thread cbowden
The default spark.sql.streaming.commitProtocolClass is https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ManifestFileCommitProtocol.scala which may or may not be the best suited for all needs. Code deploys could be improved by ensuring