Using Apache Kylin as data source for Spark

2018-05-17 Thread ShaoFeng Shi
Hello, Kylin and Spark users,

A doc is newly added in Apache Kylin website on how to using Kylin as a
data source in Spark;
This can help the users who want to use Spark to analysis the aggregated
Cube data.

https://kylin.apache.org/docs23/tutorial/spark.html

Thanks for your attention.

-- 
Best regards,

Shaofeng Shi 史少锋


Re: Continuous Processing mode behaves differently from Batch mode

2018-05-17 Thread Yuta Morisawa

Thank you for reply.

I checked WEB UI and found that the total number of tasks is 10.
So, I changed the number of cores from 1 to 10, then it works well.

But I haven't figure out what is happening.

My assumption is that each Job consists of 10 tasks in default and each 
task occupies 1 core.

So, in my case, assigning only 1 core cause the issue.
In other words, Continuous mode needs at least 10 cores.

Is it right?


Regards;
Yuta

On 2018/05/16 15:24, Shixiong(Ryan) Zhu wrote:
One possible case is you don't have enough resources to launch all tasks 
for your continuous processing query. Could you check the Spark UI and 
see if all tasks are running rather than waiting for resources?


Best Regards,

Shixiong Zhu
Databricks Inc.
shixi...@databricks.com 

databricks.com 

http://databricks.com 





On Tue, May 15, 2018 at 5:38 PM, Yuta Morisawa 
mailto:yu-moris...@kddi-research.jp>> wrote:


Hi all

Now I am using Structured Streaming in Continuous Processing mode
and I faced a odd problem.

My code is so simple that it is similar to the sample code on the
documentation.

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#continuous-processing




When I send the same text data ten times, for example 10 lines text,
in Batch mode the result has 100 lines.

But in Continuous Processing mode the result has only 10 lines.
It appears duplicated lines are removed.

The difference of these two codes is only with or without trigger
method.

Why these two code behave differently ?


--
Regard,
Yuta


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org






-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Getting Data From Hbase using Spark is Extremely Slow

2018-05-17 Thread SparkUser6
I have written four lines of simple spark program to process data in Phoenix
table:  
queryString = getQueryFullString( );// Get data from Phoenix table select
col from table


JavaPairRDD phRDD = jsc.newAPIHadoopRDD(
configuration,
PhoenixInputFormat.class,
NullWritable.class,
TestWritable.class);
   
 JavaRDD rdd = phRDD.map(new Function, Long>() {  
@Override//Goal is to scan all the data
public Long call(Tuple2 tuple) throws
Exception {
return 1L;
}
});
   System.out.println(rdd.count());

This program takes 2 hours to process for 2 million record, can anyone help
me understand what is wrong.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[structured-streaming] foreachPartition alternative in structured streaming.

2018-05-17 Thread karthikjay
I am reading data from Kafka using structured streaming and I need to save
the data to InfluxDB. In the regular Dstreams based approach I did this as
follows:

  val messages:DStream[(String, String)] =  kafkaStream.map(record =>
(record.topic, record.value))
  messages.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
  val influxService = new InfluxService()
  val connection = influxService.createInfluxDBConnectionWithParams(
  host,
  port,
  username,
  password,
  database
  )
  partitionOfRecords.foreach(record => {
ABCService.handleData(connection, record._1, record._2)
  }
  )
}
  }
  ssc.start()
  logger.info("Started Spark-Kafka streaming session")
  ssc.awaitTermination()

Note: I create connection object inside foreachpartition. How do I do this
in Structured Streaming ? I tried connection pooling approach (where I
create a pool of connections on the master node and pass it to worker nodes
)  here

  
and the workers could not get the connection pool object. Anything obvious
that I am missing here ?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Snappy file compatible problem with spark

2018-05-17 Thread JF Chen
Yes. The JSON files compressed by Flume or Spark work well with Spark. But
the json files compressed by myself cannot be read by spark due to codec
problem. It seems sparking can read files compressed by hadoop snappy(
https://code.google.com/archive/p/hadoop-snappy/) only


Regard,
Junfeng Chen

On Thu, May 17, 2018 at 5:47 PM, Victor Noagbodji <
vnoagbo...@amplify-nation.com> wrote:

> Hey, Sorry if I misunderstood. Are you feeding the compressed JSON file to
> Spark directly?
>
> On May 17, 2018, at 4:59 AM, JF Chen  wrote:
>
> I made some snappy compressed json file with normal snappy codec(
> https://github.com/xerial/snappy-java ) , which seems cannot be read by
> Spark correctly.
> So how to make existed snappy file recognized by spark? Any tools to
> convert them?
>
> Thanks@!
>
> Regard,
> Junfeng Chen
>
>
>


Snappy file compatible problem with spark

2018-05-17 Thread JF Chen
I made some snappy compressed json file with normal snappy codec(
https://github.com/xerial/snappy-java ) , which seems cannot be read by
Spark correctly.
So how to make existed snappy file recognized by spark? Any tools to
convert them?

Thanks@!

Regard,
Junfeng Chen


Spark Jobs ends when assignment not found for Kafka Partition

2018-05-17 Thread Biplob Biswas
Hi,

I am having this peculiar problem with our spark jobs in our cluster, where
the spark job ends with a message:

No current assignment for partition iomkafkaconnector-deliverydata-dev-2


We have a setup where we have 4 kafka partitions and 4 spark executors, so
each partition should be directly read by each executor.  Also, The
interesting thing to note is that we haven't set the '
spark.dynamicAllocation.enabled' to true but still the executors become
dead and new executors are added at times.

The problem I see is that when executors go down, the partitions are not
reassigned to active executors, I can't say why is this happening, but I
have the following log which I receive after which the spark job dies.



18/05/12 03:52:52 INFO internals.ConsumerCoordinator: Setting newly
assigned partitions [hierarchy-updates-dev-0,
iomkafkaconnector-deliverydata-dev-0,
iomkafkaconnector-deliverydata-dev-1] for group
data-in-cleansing-consumer
18/05/12 03:53:01 ERROR scheduler.JobScheduler: Error generating jobs
for time 152607561 ms
java.lang.IllegalStateException: No current assignment for partition
iomkafkaconnector-deliverydata-dev-2
at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:251)
at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:315)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1170)
at 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:194)
at 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:211)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:336)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:334)
at scala.Option.orElse(Option.scala:289)
at 
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:331)
at 
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:122)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:121)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at 
org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:121)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
at scala.util.Try$.apply(Try.scala:192)
at 
org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247)
at 
org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
18/05/12 03:53:01 ERROR yarn.ApplicationMaster: User class threw
exception: java.lang.IllegalStateException: No current assignment for
partition iomkafkaconnector-deliverydata-dev-2
java.lang.IllegalStateException: No current assignment for partition
iomkafkaconnector-deliverydata-dev-2
at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:251)
at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:315)
at 
org.apache.kafka.clients.consumer.KafkaConsu